What is the Real Cost of std::memory_order on ARM64? - Jetson Orin Benchmark
"On ARM, memory_order_seq_cst is very expensive, so use release/acquire whenever possible."
We directly measured whether this widely circulated advice in the C++ community is still valid on AArch64 (64-bit ARM).
Conclusion First
| Item | Conclusion |
|---|---|
| Is seq_cst expensive on ARMv7 (32-bit)? | Yes. Two DMB barriers are inserted, consuming tens to hundreds of cycles |
| Is it also expensive on ARMv8 AArch64 (64-bit)? | No. Both release/acquire and seq_cst compile to the same STLR/LDAR instructions |
| Is release/acquire more expensive than relaxed? | ~6ns pipeline stall when store+load pair in the same thread. No difference for standalone store/load |
| Does it affect a 1kHz RT loop? | Virtually none. Total 89ns for 25 atomic operations = 0.0089% of 1ms period |
| Can we switch to relaxed? | No. Cannot guarantee correctness. 77ns savings is meaningless while the risk is critical |
Background: Why We Did This Analysis
Our team operates a 1kHz real-time robot control system on NVIDIA Jetson Orin (Cortex-A78AE). It is a hard real-time environment where EtherCAT communication, CiA 402 state machine, and PID torque control must all complete within the 1ms period.
Our codebase extensively uses std::atomic for communication between RT and non-RT threads. Seqlock pattern sequence counters, shutdown flags, and state machine transitions are all protected with memory_order_release/acquire.
We decided to directly measure whether this atomic usage could be a performance bottleneck in a 1kHz RT loop.
Instruction Mapping by ARM Architecture Generation
The key point is that ARMv7 (32-bit) and ARMv8 AArch64 (64-bit) use completely different instruction sets.
ARMv7 (32-bit) - Barrier Based
ARMv7 has no dedicated atomic store/load instructions. Compilers insert DMB (Data Memory Barrier) instructions with regular STR/LDR to guarantee ordering.
Store release: DMB ISH -> STR (1 barrier)
Store seq_cst: DMB ISH -> STR -> DMB ISH (2 barriers!)
Load acquire: LDR -> DMB ISH (1 barrier)
Load seq_cst: LDR -> DMB ISH (1 barrier)
Each DMB consumes tens to hundreds of cycles. Since seq_cst store has 2 DMBs, the cost can be twice that of release.
ARMv8 AArch64 (64-bit) - Dedicated Instructions
AArch64 has acquire/release semantics built into the instructions themselves.
Store release: STLR (Store-Release, single instruction)
Store seq_cst: STLR (same instruction!)
Load acquire: LDAR (Load-Acquire, single instruction)
Load seq_cst: LDAR (same instruction!)
DMB is completely eliminated. And crucially, release and seq_cst compile to the same instruction.
ARMv8.3+ FEAT_LRCPC - Subtle Differences Emerge
FEAT_LRCPC, introduced as optional in ARMv8.2 and made mandatory in ARMv8.3, added the LDAPR instruction.
Load acquire: LDAPR (Does not wait for previous STLR completion)
Load seq_cst: LDAR (Waits for previous STLR drain)
| Operation | C++ memory_order | ARMv7 | AArch64 | AArch64 + LRCPC |
|---|---|---|---|---|
| Store | relaxed | STR | STR | STR |
| Store | release | DMB + STR | STLR | STLR |
| Store | seq_cst | DMB + STR + DMB | STLR | STLR |
| Load | relaxed | LDR | LDR | LDR |
| Load | acquire | LDR + DMB | LDAR | LDAPR |
| Load | seq_cst | LDR + DMB | LDAR | LDAR |
Only on processors with FEAT_LRCPC does a measurable difference exist between acquire and seq_cst. Jetson Orin's Cortex-A78AE is ARMv8.2 but supports FEAT_LRCPC.
Jetson Orin Benchmark Results
Benchmark Environment
| Item | Value |
|---|---|
| SoC | NVIDIA Jetson Orin (Cortex-A78AE) |
| ISA | ARMv8.2-A + FEAT_LRCPC |
| Counter frequency | 31.2 MHz |
| Compiler | g++ -O2 -std=c++17 -march=native |
| Iterations | 10,000,000 (10 million, 1 million warmup) |
Benchmark Code (Core Section)
#include <atomic>
#include <chrono>
#include <cstdio>
#include <thread>
static std::atomic<uint64_t> g_counter{0};
constexpr int ITERATIONS = 10'000'000;
constexpr int WARMUP = 1'000'000;
// 1. Store-only benchmark
template <std::memory_order Order>
double bench_store_only() {
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < ITERATIONS; ++i) {
g_counter.store(i, Order);
}
auto end = std::chrono::steady_clock::now();
double ns = std::chrono::duration<double, std::nano>(end - start).count();
return ns / ITERATIONS;
}
// 2. Load-only benchmark
template <std::memory_order Order>
double bench_load_only() {
volatile uint64_t sink = 0;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < ITERATIONS; ++i) {
sink = g_counter.load(Order);
}
auto end = std::chrono::steady_clock::now();
double ns = std::chrono::duration<double, std::nano>(end - start).count();
return ns / ITERATIONS;
}
// 3. Store+Load pair benchmark (same thread)
template <std::memory_order StoreOrder, std::memory_order LoadOrder>
double bench_store_load_pair() {
volatile uint64_t sink = 0;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < ITERATIONS; ++i) {
g_counter.store(i, StoreOrder);
sink = g_counter.load(LoadOrder);
}
auto end = std::chrono::steady_clock::now();
double ns = std::chrono::duration<double, std::nano>(end - start).count();
return ns / ITERATIONS;
}
// 4. Cross-thread benchmark
template <std::memory_order StoreOrder, std::memory_order LoadOrder>
double bench_cross_thread() {
std::atomic<bool> stop{false};
std::atomic<uint64_t> read_count{0};
// Reader thread
std::thread reader([&] {
uint64_t count = 0;
volatile uint64_t sink = 0;
while (!stop.load(std::memory_order_relaxed)) {
sink = g_counter.load(LoadOrder);
++count;
}
read_count.store(count, std::memory_order_relaxed);
});
// Writer (this thread)
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < ITERATIONS; ++i) {
g_counter.store(i, StoreOrder);
}
auto end = std::chrono::steady_clock::now();
stop.store(true, std::memory_order_relaxed);
reader.join();
double ns = std::chrono::duration<double, std::nano>(end - start).count();
return ns / ITERATIONS;
}
Measurement Results
Store Only
| memory_order | Instruction | Time | vs relaxed |
|---|---|---|---|
relaxed | STR | 0.46 ns/op | baseline |
release | STLR | 0.46 ns/op | +0.00 ns (+0.2%) |
seq_cst | STLR | 0.46 ns/op | +0.00 ns |
Load Only
| memory_order | Instruction | Time | vs relaxed |
|---|---|---|---|
relaxed | LDR | 0.46 ns/op | baseline |
acquire | LDAPR | 0.46 ns/op | -0.00 ns (-0.3%) |
seq_cst | LDAR | 0.46 ns/op | +0.00 ns |
For standalone store or load, the cost difference by memory ordering is too small to measure.
Store+Load Pair (Same Thread)
| memory_order | Time | vs relaxed |
|---|---|---|
relaxed/relaxed | 0.93 ns/pair | baseline |
release/acquire | 7.09 ns/pair | +6.16 ns (+661%) |
seq_cst/seq_cst | 7.09 ns/pair | +6.16 ns (+661%) |
This is where the real cost originates. Executing LDAR/LDAPR immediately after STLR causes a pipeline stall. However, the cost of release/acquire and seq_cst is exactly the same.
fetch_add (Read-Modify-Write)
| memory_order | Time |
|---|---|
relaxed | 6.01 ns/op |
acq_rel | 6.00 ns/op |
seq_cst | 5.99 ns/op |
RMW operations have the same cost regardless of ordering. This is because they internally use LDXR/STXR or CAS loops.
Cross-Thread Store/Load
| memory_order | Writer Time | vs release/acquire |
|---|---|---|
relaxed/relaxed | 0.49 ns/write | -2.31 ns |
release/acquire | 2.80 ns/write | baseline |
seq_cst/seq_cst | 3.14 ns/write | +0.34 ns (+12%) |
In cross-thread scenarios, seq_cst is 0.34ns more expensive than release/acquire. The difference between LDAPR (acquire) and LDAR (seq_cst) shows here.
Result Analysis: Where Does the Cost Come From?
STLR to LDAR Pipeline Stall
The reason the store+load pair cost jumps from 0.93ns to 7.09ns in the same thread is due to the characteristics of the STLR instruction. To provide the guarantee that "this store must be observed after all previous memory operations", STLR delays subsequent LDAR while draining the store buffer.
Time ->
relaxed: STR --- LDR ------ (0.93ns, passes through pipeline)
release: STLR --- wait drain --- LDAPR -- (7.09ns, waits for store buffer drain)
seq_cst: STLR --- wait drain --- LDAR --- (7.09ns, same stall)
Key point: This cost occurs equally with memory_order_release. "Downgrading" seq_cst to release/acquire does not reduce the cost of the store+load pattern within the same thread.
LDAPR vs LDAR (Cross-Thread)
FEAT_LRCPC's LDAPR provides relaxed semantics of "don't need to wait for complete drain of previous STLR". This difference appears as the 0.34ns gap in cross-thread benchmarks.
However, this difference is negligible in absolute terms.
Actual RT Loop Impact Calculation
Atomic operation profile for a 1kHz control loop:
| Operation Type | Approx Count/cycle | Cost (relaxed) | Cost (release/acquire) |
|---|---|---|---|
| Seqlock sequence store | 2 | 0.92 ns | 0.92 ns |
| Seqlock sequence load | 4 | 1.84 ns | 1.84 ns |
| Status flag load | ~10 | 4.60 ns | 4.60 ns |
| Store+Load pair | ~5 | 4.65 ns | 35.45 ns |
| Cross-thread load | ~4 | 1.96 ns | 11.20 ns |
| Total | ~25 | ~14 ns | ~54 ns |
54ns / 1,000,000ns (1ms) = 0.0054% of the period budget is used.
Even with extremely generous calculations, it is less than 100ns, not even reaching 0.01% of the 1ms budget. Compared to PID calculation (~10us), dynamics computation (~30us), and EtherCAT PDO communication (~50us), it is completely at noise level.
Why You Should Not Switch to Relaxed
The temptation might arise: "If the cost is minimal anyway, can't we just use relaxed?" But ARM64 is a weakly-ordered architecture.
No Visibility Guarantee with Relaxed
// Thread 1 (RT controller)
data_.store(new_value, std::memory_order_relaxed);
running_.store(false, std::memory_order_relaxed);
// Thread 2 (shutdown handler)
while (running_.load(std::memory_order_relaxed)) {
// On ARM64 this loop could be delayed indefinitely!
// relaxed does not guarantee visibility timing.
}
memory_order_relaxed does not guarantee visibility timing to other cores. The C++ standard recommends that it "should" become visible within a reasonable time, but this is not a requirement (shall). On x86, stores propagate relatively quickly due to the TSO (Total Store Order) model, but on ARM64, values can stay in the store buffer for unpredictable durations.
Guarantees Provided by release/acquire
// Thread 1
data_.store(new_value, std::memory_order_relaxed);
running_.store(false, std::memory_order_release); // Guarantees data_ write completes first
// Thread 2
while (running_.load(std::memory_order_acquire)) { // acquire paired with release
// happens-before relationship established
// If running_ == false is observed, data_'s new_value is also guaranteed
}
// Here data_.load(relaxed) must return new_value
release/acquire establishes a happens-before relationship. It is not simply "becomes visible quickly" but a formal guarantee of "must be visible in the correct order".
Risk vs Cost
| Choice | Savings | Risk |
|---|---|---|
| Use relaxed | ~40ns/cycle (0.004%) | Visibility delay -> shutdown failure, data inconsistency, undefined behavior |
| Use release/acquire | 0 | 0 (correctness guaranteed) |
There is no reason to write code where the robot arm could indefinitely ignore shutdown signals to save 40ns.
Practical Guidelines
When to Use Which memory_order
Need correctness?
|
+----+----+
| Yes | No (counters, statistics, etc.)
| |
+-----+-----+ +-- relaxed
| |
Only single Need ordering
variable of multiple
visibility variables
| |
+-----+-----+
|
release/acquire
|
+-----+-----+
| |
AArch64 x86-64
(cost: ~3ns) (cost: ~0ns, TSO)
| |
+-----+-----+
|
When seq_cst is needed:
- Need global ordering across multiple atomic variables
- Algorithms like Peterson's lock
- Default when unsure
Practical Rules for AArch64
-
relaxedis only for pure statistics/counters. Do not use for any flags or state variables that affect logic. -
release/acquireis the default choice. On AArch64, the cost is nearly identical to seq_cst while expressing intent more clearly. -
There is no need to avoid seq_cst. On AArch64, choosing release/acquire over seq_cst only saves 0.34ns in cross-thread scenarios.
-
Avoid load immediately after store in the same thread. This is the source of the 7ns stall. If possible, interleave other operations between store and load, or reconsider the design.
-
RMW (fetch_add, etc.) is ordering-independent. The cost is the same regardless of memory_order, so choose the safe option (acq_rel or seq_cst).
Validity by Architecture Generation
The advice "seq_cst is expensive on ARM" is true for ARMv7 (32-bit) but does not apply to AArch64 (64-bit). Always verify the target architecture when referencing ARM memory ordering articles.
| Architecture | seq_cst Additional Cost | Effect of Avoiding seq_cst |
|---|---|---|
| ARMv7 (32-bit) | 1 additional DMB (tens to hundreds of ns) | Significant |
| AArch64 (64-bit) | 0ns (same instruction) | None |
| AArch64 + LRCPC | ~0.3ns on load | Negligible |
Key Takeaways
-
ARMv7 and AArch64 are different. The advice "seq_cst is expensive on ARM" only applies to 32-bit ARMv7. On AArch64, release/acquire and seq_cst compile to the same instructions (STLR/LDAR).
-
The real source of cost is the pipeline stall from store+load pairs (~7ns). This cost occurs equally with release/acquire and does not decrease by avoiding seq_cst.
-
Atomics are not a bottleneck in a 1kHz RT loop. The total cost of 25 atomic operations is ~54ns, which is 0.005% of the 1ms budget. Compared to dynamics computation (~30us) or EtherCAT communication (~50us), it is at noise level.
-
Do not switch to relaxed. ARM64 is a weakly-ordered architecture. Giving up visibility guarantees to save ~40ns creates critical risks like shutdown failure and data inconsistency.
-
Correct code always comes before fast code. And on AArch64, correct code is also fast code.