Skip to main content
Is PREEMPT_RT Enough? Validating Real-Time Performance on Jetson Orin
Embedded/Real-time

Is PREEMPT_RT Enough? Validating Real-Time Performance on Jetson Orin

We validate with measured data whether PREEMPT_RT kernel and CPU isolation alone can enable 1kHz EtherCAT control on NVIDIA Jetson Orin. Under GPU, Storage, and EtherCAT load conditions, we achieved max latency below 20us.

WRWIM Robotics Team
·
jetsonpreempt-rtreal-timeethercatcpu-isolation

Is PREEMPT_RT Enough? Validating Real-Time Performance on Jetson Orin

Engineers looking to deploy robot control systems on Jetson have likely asked themselves this question at some point:

"Is installing the PREEMPT_RT kernel enough for 1kHz control? Or do I need to build a custom kernel?"

In this article, we validate with measured data whether NVIDIA's official RT package + CPU isolation alone can meet the latency requirements for 1kHz EtherCAT control on a Jetson Orin AGX.

The conclusion upfront: PREEMPT_RT alone is not enough. Without CPU isolation, abnormal jitter spikes exceeding 100us frequently occurred under GPU, Storage, and EtherCAT loads. PREEMPT_RT + CPU isolation together were required to achieve max latency below 20us under all load conditions.

1kHz Control and Latency Requirements

Why 100us?

A 1kHz control loop operates at a 1ms (1000us) period. The following tasks must execute sequentially in each cycle:

  1. Wakeup - RT task is awakened by the scheduler
  2. Read - Receive EtherCAT frame, read sensor data
  3. Compute - Control algorithm computation (PID, inverse kinematics, etc.)
  4. Write - Send motor commands

All these tasks must complete within 1ms to not miss the next cycle. Generally, wakeup latency should be 10% or less of the period to ensure sufficient time for remaining tasks. Therefore, the wakeup latency target for 1kHz control is 100us.

Why EtherCAT Communication is Vulnerable

The EtherCAT master periodically sends and receives frames. Problems that occur during this process:

  • Interrupt interference: Interrupts from NVMe, GPU, network, etc. preempting RT tasks delays frame transmission timing
  • Scheduling delays: If kernel threads (ksoftirqd, kworker, etc.) execute before RT tasks, wakeup is delayed
  • DC synchronization issues: EtherCAT Distributed Clocks compensate for master jitter to some extent, but severe delays cause missed deadlines for next cycle frame transmission, resulting in missed control loop data updates. DC provides sub-1us synchronization between slaves, but master-side timing must be managed separately

This can ultimately lead to motor torque output delays, increased position tracking errors, and in severe cases, control loop divergence.

Reference Research

Research measuring EtherCAT control performance in ROS2 environments (ROS2 Performance Study, 2023) reported average 2us, maximum 82us latency. These figures meet the 1kHz control criterion (100us).

Test Goals and Environment

Test Goals

As explained earlier, EtherCAT masters are vulnerable to interrupts and scheduling. The PREEMPT_RT kernel alone cannot completely block such interference. CPU isolation secures dedicated cores for RT tasks, ensuring most system activities (GPU rendering, NVMe I/O, general kernel threads, etc.) do not execute on those cores. However, some exceptions remain such as IPI (inter-processor interrupts), local timers, and per-CPU kernel threads.

Core question: Is NVIDIA OTA RT package + CPU isolation sufficient for 1kHz control loops?

This test validates whether wakeup latency remains stable without spikes exceeding 100us even when various system loads occur with CPUs 8-11 isolated.

Test Environment

ItemSpecification
HardwareJetson Orin AGX 64GB
KernelNVIDIA RT Package (OTA)
Power ModeMAXN + jetson_clocks
Test Duration10 minutes (600,000 samples)
Measurement Toolcyclictest -p 80 -i 1000 -l 600000 -m -a 8-11
RT ThrottlingDisabled (sched_rt_runtime_us=-1)
Success CriterionAll tests Max latency < 100us

Effect of RT Kernel

First, let's verify the effect of the RT kernel itself. Under combined load (CPU + I/O + Memory):

KernelMax LatencyNote
non-RT (stock)318usExceeds 100us criterion
RT (PREEMPT_RT)56us82% improvement

PREEMPT_RT alone improves from 318us to 56us - an 82% improvement. While this meets the 100us criterion for simple stress-ng combined loads, we cannot conclude "real-time control is possible" from this alone.

In actual operating environments, various loads occur including GPU rendering, NVMe I/O, and EtherCAT communication. Testing under these real load conditions reveals frequent abnormal jitter spikes with PREEMPT_RT alone.

How PREEMPT_RT Works

The PREEMPT_RT patch transforms the standard Linux kernel into a fully preemptible kernel. Key changes:

Threaded IRQ (Interrupt Threading)

In the standard kernel, interrupt handlers execute in interrupt context where the scheduler cannot intervene. PREEMPT_RT converts most interrupt handlers into kernel threads (irq/N-name), making them scheduling targets.

IRQ threads run by default at SCHED_FIFO priority 50. Therefore, RT tasks with higher priority (e.g., 80) can be scheduled before IRQ threads. However, top-half (hardirq) still executes in interrupt context, and interrupts marked with IRQF_NO_THREAD (timers, IPI, etc.) are not threaded.

Priority Inheritance

Priority inversion is a classic problem in real-time systems:

  1. Low-priority task (L) acquires a lock
  2. High-priority RT task (H) requests the same lock and waits
  3. Medium-priority task (M) preempts L - H runs later than M

PREEMPT_RT's RT-Mutex implements priority inheritance. When L holds a lock and H waits, L's priority temporarily matches H's, preventing M from preempting L. L reverts to its original priority when releasing the lock.

This mechanism became essential for real-time systems after the Mars Pathfinder mission in 1997 experienced system resets due to priority inversion.

Spinlock to RT-Mutex Conversion

The standard kernel's spin_lock is implemented with busy-waiting. It occupies the CPU while waiting for lock release, preventing other tasks from executing.

PREEMPT_RT converts most spin_locks to sleepable RT-mutexes:

AspectStandard Kernel (spinlock)PREEMPT_RT (RT-mutex)
Wait MethodBusy-waiting (CPU occupied)Sleep (CPU yielded)
PreemptibleNoYes
Priority InheritanceNoYes

However, raw_spin_lock remains unconverted as original spinlocks. These are used in minimal sections requiring interrupt disabling (hardware register access, etc.).

This enables preemption during kernel critical sections protected by RT-mutex, greatly reducing the time high-priority RT tasks are blocked by low-priority kernel work.

PREEMPT_RT Limitations

PREEMPT_RT is powerful but does not solve all jitter causes:

Jitter CausePREEMPT_RT SolutionAdditional Action Needed
Interrupt handler delayThreaded IRQ-
Priority inversionPriority Inheritance-
Kernel critical sectionRT-Mutex-
Interrupt locationNoirqaffinity needed
Cache pollutionNoisolcpus needed
Kernel thread contentionNokthread_cpus needed
Timer ticksNonohz_full needed (not in OTA)

PREEMPT_RT guarantees "if an RT task can run, it runs immediately" but does not guarantee "interference is prevented from accessing the CPU where the RT task runs." This is the fundamental reason why CPU isolation is needed.

Why CPU Isolation is Needed

Even with the RT kernel installed, latency spikes occur under certain load conditions without CPU isolation. To understand this phenomenon, we need to examine the OS's interrupt handling, cache architecture, and scheduling mechanisms.

Root Causes of Jitter

In real-time control, "jitter" refers to execution time variation of periodic tasks. Even with low average latency, intermittent abnormal spikes mean proper real-time control is not possible.

Key reasons jitter spikes occur even with PREEMPT_RT kernel:

1. Interrupt Interference

The Linux kernel divides interrupt processing into two stages:

  • Top-half (hardirq): Executes immediately upon IRQ occurrence. Performs minimal processing like hardware ACK with interrupts disabled
  • Bottom-half: Actual data processing. In PREEMPT_RT, executes as kernel thread (irq/N-name) and is preemptible

The problem is top-half still executes in interrupt context. When NVMe I/O, GPU, or network interrupts occur, however brief, the RT task must wait until top-half completes.

Thanks to PREEMPT_RT's threaded IRQ, bottom-half becomes a scheduling target, but which CPU processes interrupts depends on IRQ affinity settings. Without isolation, top-half can execute on the core where RT tasks are running.

2. Cache Pollution

Modern CPUs use multi-level caches (L1/L2/L3) to speed up memory access. When an RT task is preempted:

  1. Another process/kernel thread executes on the CPU
  2. That process's data is loaded into cache - RT task's cache lines are evicted
  3. When the RT task resumes, cache miss occurs
  4. Main memory access needed - tens to hundreds of cycles delay

This same phenomenon occurs with TLB (Translation Lookaside Buffer). TLB misses trigger page table walks, causing additional delays.

Cache Pollution Effect

Hot Cache (~10ns) -> Cache Eviction -> Cold Cache (~100ns): 10x delay from cache miss causes jitter spikes

3. Kernel Thread Contention

The Linux kernel handles various background tasks through kernel threads:

Kernel ThreadRoleImpact on RT Tasks
ksoftirqdSoft interrupt processing (network, timers)Batch processing delays when softirq delayed
kworkerAsync kernel work queue processingExecutes at unpredictable times
rcu_preemptRCU callback processingRuns periodically on all CPUs
migrationTask movement between CPUsMigrated tasks start with cold cache

These threads can run on any CPU by default and compete with RT tasks.

4. Timer Tick Overhead

The default Linux kernel generates periodic timer interrupts (ticks) on all CPUs. Each tick involves:

  1. Interrupt handler execution
  2. Scheduler invocation (runqueue check)
  3. Time-related statistics updates
  4. RCU callback checks

This overhead is typically a few microseconds, but combined with other factors, it can lead to spikes. The nohz_full option removes these ticks on isolated CPUs, but is not included in NVIDIA's OTA RT package.

How CPU Isolation Works

CPU isolation acts on several kernel subsystems simultaneously. Let's examine how each parameter works at the kernel level.

isolcpus: Exclusion from Scheduler Domains

isolcpus=domain,8-11 excludes CPUs 8-11 from the kernel scheduler's load balancing targets:

# Check /sys/devices/system/cpu/cpu8/domain*
# Isolated CPUs have their own single-CPU domain

The kernel scheduler periodically redistributes tasks between CPUs (load balancing). Isolated CPUs are excluded from this process:

  • Regular processes are not automatically placed on them
  • RT tasks' L1/L2 cache and TLB are not polluted (LLC/L3 is shared and can still be affected)
  • Only tasks explicitly specified with taskset or sched_setaffinity() execute
isolcpus vs cpuset

isolcpus has the constraint that it cannot be changed after boot, so cgroup/cpuset is recommended for environments needing runtime flexibility. However, in real-time/embedded systems, isolcpus provides more reliable isolation. NVIDIA's official RT kernel documentation also recommends isolcpus=managed_irq,domain.

irqaffinity: Interrupt Routing Control

irqaffinity=0-7 sets the default affinity of all IRQs to CPUs 0-7 at kernel boot:

# Check after boot
cat /proc/irq/*/smp_affinity_list
# Most device IRQs set to 0-7 (per-CPU/managed IRQs are exceptions)

This setting changes the default value of /proc/irq/<irq>/smp_affinity. However, userspace drivers or irqbalance can change this at runtime, so in production environments, disabling irqbalance or locking settings is recommended.

kthread_cpus: Kernel Thread Isolation

kthread_cpus=0-7 restricts default CPU affinity when creating kernel threads:

# Check after boot
ps -eo pid,comm,psr | grep -E 'ksoftirqd|kworker|rcu'
# Most kernel threads run on CPU 0-7

However, not all kernel threads follow this setting. Some per-CPU kernel threads (e.g., migration/N, cpuhp/N) are bound to specific CPUs and cannot be moved. These threads still run on isolated CPUs, but their execution frequency is low, so actual impact is minimal.

Synergy of the Three Parameters

LayerParameterIsolation Target
Schedulerisolcpus=domainRegular processes
InterruptsirqaffinityHardware interrupts
Kernelkthread_cpusKernel background threads

All three parameters must be used together for complete isolation. Setting only isolcpus isolates processes but interrupts and kernel threads can still invade RT cores.

Average vs Worst-Case Latency

What matters in real-time systems is worst-case, not average latency.

MetricMeaningImpact on 1kHz Control
Average LatencyPerformance in most casesGeneral control quality
Worst-Case LatencyPerformance in worst caseControl failure if exceeded even once

In robot control, jitter spikes lead to motor torque output delays, missed sensor feedback, and control loop instability.

Results from 10-minute tests (600,000 samples) without isolation:

LoadAvg LatencyMax LatencyVerdict
Idle2.4us22usPASS
GPU (glmark2)3.6us159usFAIL
EtherCAT (1kHz DC)3.8us113usFAIL
Storage (fio)5.3us145usFAIL
System (stress-ng)6.1us47usPASS

Average latency is good at under 10us for all, but Max latency spikes exceeding 100us occurred under GPU, EtherCAT, and Storage loads. Repeated spikes lead to degraded control quality and are unacceptable in applications requiring precision control.

CPU Isolation Configuration

We applied the following boot parameters to isolate CPUs 8-11 for RT use:

ParameterRole
isolcpus=managed_irq,domain,8-11Scheduler isolation + managed IRQ isolation
irqaffinity=0-7Restrict general IRQs to CPU 0-7
kthread_cpus=0-7Restrict kernel threads to CPU 0-7

isolcpus alone is incomplete. The managed_irq flag only isolates IRQs automatically managed by the kernel (MSI-X, NVMe, etc.), so irqaffinity and kthread_cpus must be used together for complete isolation of general IRQs and kernel threads.

Application-Level Optimization

Kernel settings alone are insufficient. The RT task itself must also be properly configured to achieve real-time performance.

RT Scheduling Policy Configuration

Linux provides several scheduling policies:

PolicyPriorityCharacteristics
SCHED_OTHER(none)Default policy, always preempted by RT tasks
SCHED_FIFO1-99RT policy, higher priority preempts lower
SCHED_RR1-99RT policy, FIFO + time-slicing among same priority

SCHED_FIFO is suitable for 1kHz control tasks. Without time-slicing, the task runs until it yields (sched_yield) or is preempted by higher priority, making behavior deterministic.

Three Elements of RT Task Configuration

For an RT task to operate deterministically, three things must be configured:

SettingAPI / CommandPurpose
Memory Lockmlockall()Prevent ms-level delays from page faults
CPU Affinitysched_setaffinity() or tasksetExecute only on isolated CPUs
RT Schedulingsched_setscheduler()Apply SCHED_FIFO policy

Order matters:

  1. mlockall() - Process page faults before RT scheduling is applied
  2. CPU affinity - Bind to isolated CPUs
  3. sched_setscheduler() - Apply RT policy last

The reason to call mlockall() first: Page faults can occur when loading memory pages, and after RT scheduling is applied, this delay directly impacts the control loop.

mlockall() Considerations
  • Stack extension preparation: mlockall() locks currently allocated stack, but stack extension can cause page faults on new pages. Prefault the stack by declaring a sufficiently large local array before entering the RT loop.
  • Memory capacity check: Since all memory is pinned to physical RAM, the OOM killer may terminate other processes if system memory is insufficient.

Results After CPU Isolation

Re-measurement under the same load conditions after CPU isolation:

LoadBefore (Max)After (Max)Improvement
Idle22us8us64%
GPU159us (FAIL)15us (PASS)91%
EtherCAT113us (FAIL)6us (PASS)95%
Storage145us (FAIL)7us (PASS)95%
System47us15us68%

The three tests that FAILED before isolation (GPU, EtherCAT, Storage) all converted to PASS. No spikes exceeding 100us were observed during the test period (10 minutes/600k samples), and all tests achieved max latency below 20us. These figures have sufficient margin from the 1kHz control criterion (100us).

CPU Isolation Effect

Red bars: Exceeds 100us (FAIL), Green bars: Below 100us (PASS), Dotted line: 100us criterion

Time Series Distribution Comparison

Comparing latency distributions measured over 10 minutes makes the isolation effect clearer:

Time series comparison

Without isolation (left), spikes occasionally exceed 100us. While average latency looks good, multiple such spikes were observed even in 10-minute tests.

With isolation (right), all measured samples remained stable below 15us. This is the essence of "real-time control" - managing worst-case, not average.

Custom Kernel Exploration Process

The Q-test results shown earlier used the NVIDIA OTA RT package. However, we also explored custom kernel builds to verify whether the nohz_full and rcu_nocbs options recommended by industry standards are truly necessary.

Why We Considered Custom Builds

Linux Kernel official documentation, Intel ECI SDK, Red Hat RT and others recommend the following options for real-time systems:

  • CONFIG_NO_HZ_FULL: Remove timer ticks on isolated CPUs
  • CONFIG_RCU_NOCB_CPU: Offload RCU callbacks to other CPUs

The NVIDIA OTA RT package does not include these options. Therefore, setting nohz_full=8-11, rcu_nocbs=8-11 as boot parameters is ignored.

Custom Kernel Build Tests (M-tests)

We built a custom kernel (CONFIG_NO_HZ_FULL=y, CONFIG_RCU_NOCB_CPU=y) and tested under Idle and combined load (L2: CPU + I/O + Memory) conditions:

ConditionKernelIsolationLoadMax LatencyStatus
M1non-RTNoIdle53usPASS
M2non-RTNoL2318usFAIL
M3RTNoIdle23usPASS
M4RTNoL256usPASS
M5RTYesIdle26usPASS
M6RTYesL224usPASS

M-tests isolation boot parameters:

isolcpus=managed_irq,domain,8-11 nohz_full=8-11 rcu_nocbs=8-11 rcu_nocb_poll irqaffinity=0-7 kthread_cpus=0-7

Step-by-Step Improvement Effect

StepChangeImprovement
non-RT -> RT (load)318us -> 56us82% reduction
RT -> RT+isolation (load)56us -> 24us57% reduction
Total (M2 -> M6)318us -> 24us92% reduction

Significance of M-tests

Test Design Differences

M-tests and Q-tests cannot be directly compared:

  • M-tests: Exploratory tests, stress-ng L2 load only, 1 minute
  • Q-tests: Production validation, individual GPU/EtherCAT/Storage loads, 10 minutes (600,000 samples)

Therefore, comparing "M6 (24us) vs Q-tests (6-15us)" to conclude "OTA is better" is not valid.

What M-tests show:

  • PREEMPT_RT effect: 82% improvement from non-RT (318us) to RT (56us)
  • CPU isolation additional effect: 57% additional improvement from RT (56us) to RT+isolation (24us)
  • Custom kernel (NO_HZ_FULL) achieves 24us: Sufficiently meets 100us criterion

What Q-tests show:

  • OTA package + CPU isolation achieves 6-15us: Also meets 100us criterion
  • Validated under various real loads (GPU, Storage, EtherCAT)

Conclusion: Both approaches meet 1kHz control requirements (100us). However, the OTA package is easy to install, while custom builds are complex. Additional experiments are needed to verify performance differences under identical conditions.

When Custom Kernel Build is Necessary

Based on JetPack 6.2 (L4T 36.x), the OTA RT package does not include the following options:

  • CONFIG_NO_HZ_FULL (Full dynticks)
  • CONFIG_RCU_NOCB_CPU (RCU callback offloading)

These options are recommended by Linux Kernel official documentation for the following cases:

"Unless you are running realtime applications or certain types of HPC workloads, you will normally NOT want this option"

When custom build is needed:

  • When PREEMPT_RT's worst-case latency (~100us) is still insufficient
  • High-frequency trading (HFT), semiconductor equipment control, and other extreme latency requirements
Custom Build Complexity

When choosing custom kernel builds, consider the following:

  • Build time: 60-90 minutes required (OTA is 5-10 minutes)
  • NVMe rootfs boot caution: If NVMe driver is a module (=m) and not included in initrd, boot fails. Built-in (=y) is safest
  • initrd sync required: Manual /boot/initrd update needed after kernel build. Most common boot failure cause

The OTA package is sufficient for most robot control applications.

Q-tests achieved 6-15us max latency with OTA package + CPU isolation, and M-tests achieved 24us with custom kernel. While direct comparison is difficult due to different test conditions, both sufficiently meet the 100us criterion.

JetPack 7 Outlook

Currently (December 2025) JetPack 7 has been released exclusively for Jetson Thor, and Orin series is not yet officially supported. JetPack 6 for Orin is the current production version. However, JetPack 7 official documentation mentions CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU related settings, suggesting these options may be included in the OTA RT package when JetPack 7 for Orin is released.

Custom Build Trade-offs

If considering custom builds, be aware of the following trade-offs:

AspectIsolated CPUsEntire System
LatencyImproved-
Housekeeping CPU load-Increased
Syscall overheadIncreasedIncreased
Throughput-Decreased

SUSE Labs analysis:

"The jitter-free power you gain on your set of isolated CPUs comes at the expense of more work for the other CPUs"

Key Takeaways

The answer to "Is PREEMPT_RT enough?":

No. PREEMPT_RT alone produces jitter spikes exceeding 100us under GPU, Storage, and EtherCAT loads. CPU isolation is mandatory, not optional.

Required configuration for 1kHz EtherCAT control:

  • NVIDIA official RT package (OTA)
  • CPU isolation: isolcpus + irqaffinity + kthread_cpus (mandatory)

Results with PREEMPT_RT + CPU isolation:

  • Max latency 6-15us achieved under all load conditions (sufficient margin from 100us criterion)
  • 91-95% jitter improvement under GPU, Storage, EtherCAT loads

Custom kernel builds are unnecessary for most robot control applications. However, if you install only PREEMPT_RT and operate without CPU isolation, abnormal jitter may occur under real load conditions.

References

Official Documentation

CPU Isolation Deep Dive

Jetson Kernel Build