What is the Real Cost of std::memory_order on ARM64? - Jetson Orin Benchmark

"On ARM, memory_order_seq_cst is very expensive, so use release/acquire whenever possible."

We directly measured whether this widely circulated advice in the C++ community is still valid on AArch64 (64-bit ARM).

Conclusion First

Item	Conclusion
Is seq_cst expensive on ARMv7 (32-bit)?	Yes. Two DMB barriers are inserted, consuming tens to hundreds of cycles
Is it also expensive on ARMv8 AArch64 (64-bit)?	No. Both release/acquire and seq_cst compile to the same STLR/LDAR instructions
Is release/acquire more expensive than relaxed?	~6ns pipeline stall when store+load pair in the same thread. No difference for standalone store/load
Does it affect a 1kHz RT loop?	Virtually none. Total 89ns for 25 atomic operations = 0.0089% of 1ms period
Can we switch to relaxed?	No. Cannot guarantee correctness. 77ns savings is meaningless while the risk is critical

Background: Why We Did This Analysis

Our team operates a 1kHz real-time robot control system on NVIDIA Jetson Orin (Cortex-A78AE). It is a hard real-time environment where EtherCAT communication, CiA 402 state machine, and PID torque control must all complete within the 1ms period.

Our codebase extensively uses std::atomic for communication between RT and non-RT threads. Seqlock pattern sequence counters, shutdown flags, and state machine transitions are all protected with memory_order_release/acquire.

We decided to directly measure whether this atomic usage could be a performance bottleneck in a 1kHz RT loop.

Instruction Mapping by ARM Architecture Generation

The key point is that ARMv7 (32-bit) and ARMv8 AArch64 (64-bit) use completely different instruction sets.

ARMv7 (32-bit) - Barrier Based

ARMv7 has no dedicated atomic store/load instructions. Compilers insert DMB (Data Memory Barrier) instructions with regular STR/LDR to guarantee ordering.

Store release:   DMB ISH  ->  STR       (1 barrier)
Store seq_cst:   DMB ISH  ->  STR  ->  DMB ISH  (2 barriers!)
Load acquire:    LDR  ->  DMB ISH       (1 barrier)
Load seq_cst:    LDR  ->  DMB ISH       (1 barrier)

Each DMB consumes tens to hundreds of cycles. Since seq_cst store has 2 DMBs, the cost can be twice that of release.

ARMv8 AArch64 (64-bit) - Dedicated Instructions

AArch64 has acquire/release semantics built into the instructions themselves.

Store release:   STLR     (Store-Release, single instruction)
Store seq_cst:   STLR     (same instruction!)
Load acquire:    LDAR     (Load-Acquire, single instruction)
Load seq_cst:    LDAR     (same instruction!)

DMB is completely eliminated. And crucially, release and seq_cst compile to the same instruction.

ARMv8.3+ FEAT_LRCPC - Subtle Differences Emerge

FEAT_LRCPC, introduced as optional in ARMv8.2 and made mandatory in ARMv8.3, added the LDAPR instruction.

Load acquire:    LDAPR    (Does not wait for previous STLR completion)
Load seq_cst:    LDAR     (Waits for previous STLR drain)

Operation	C++ memory_order	ARMv7	AArch64	AArch64 + LRCPC
Store	`relaxed`	`STR`	`STR`	`STR`
Store	`release`	`DMB` + `STR`	`STLR`	`STLR`
Store	`seq_cst`	`DMB` + `STR` + `DMB`	`STLR`	`STLR`
Load	`relaxed`	`LDR`	`LDR`	`LDR`
Load	`acquire`	`LDR` + `DMB`	`LDAR`	`LDAPR`
Load	`seq_cst`	`LDR` + `DMB`	`LDAR`	`LDAR`

Only on processors with FEAT_LRCPC does a measurable difference exist between acquire and seq_cst. Jetson Orin's Cortex-A78AE is ARMv8.2 but supports FEAT_LRCPC.

Jetson Orin Benchmark Results

Benchmark Environment

Item	Value
SoC	NVIDIA Jetson Orin (Cortex-A78AE)
ISA	ARMv8.2-A + FEAT_LRCPC
Counter frequency	31.2 MHz
Compiler	g++ -O2 -std=c++17 -march=native
Iterations	10,000,000 (10 million, 1 million warmup)

Benchmark Code (Core Section)

#include <atomic>
#include <chrono>
#include <cstdio>
#include <thread>

static std::atomic<uint64_t> g_counter{0};
constexpr int ITERATIONS = 10'000'000;
constexpr int WARMUP = 1'000'000;

// 1. Store-only benchmark
template <std::memory_order Order>
double bench_store_only() {
    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        g_counter.store(i, Order);
    }
    auto end = std::chrono::steady_clock::now();
    double ns = std::chrono::duration<double, std::nano>(end - start).count();
    return ns / ITERATIONS;
}

// 2. Load-only benchmark
template <std::memory_order Order>
double bench_load_only() {
    volatile uint64_t sink = 0;
    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        sink = g_counter.load(Order);
    }
    auto end = std::chrono::steady_clock::now();
    double ns = std::chrono::duration<double, std::nano>(end - start).count();
    return ns / ITERATIONS;
}

// 3. Store+Load pair benchmark (same thread)
template <std::memory_order StoreOrder, std::memory_order LoadOrder>
double bench_store_load_pair() {
    volatile uint64_t sink = 0;
    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        g_counter.store(i, StoreOrder);
        sink = g_counter.load(LoadOrder);
    }
    auto end = std::chrono::steady_clock::now();
    double ns = std::chrono::duration<double, std::nano>(end - start).count();
    return ns / ITERATIONS;
}

// 4. Cross-thread benchmark
template <std::memory_order StoreOrder, std::memory_order LoadOrder>
double bench_cross_thread() {
    std::atomic<bool> stop{false};
    std::atomic<uint64_t> read_count{0};

    // Reader thread
    std::thread reader([&] {
        uint64_t count = 0;
        volatile uint64_t sink = 0;
        while (!stop.load(std::memory_order_relaxed)) {
            sink = g_counter.load(LoadOrder);
            ++count;
        }
        read_count.store(count, std::memory_order_relaxed);
    });

    // Writer (this thread)
    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        g_counter.store(i, StoreOrder);
    }
    auto end = std::chrono::steady_clock::now();
    stop.store(true, std::memory_order_relaxed);
    reader.join();

    double ns = std::chrono::duration<double, std::nano>(end - start).count();
    return ns / ITERATIONS;
}

Measurement Results

Store Only

memory_order	Instruction	Time	vs relaxed
`relaxed`	`STR`	0.46 ns/op	baseline
`release`	`STLR`	0.46 ns/op	+0.00 ns (+0.2%)
`seq_cst`	`STLR`	0.46 ns/op	+0.00 ns

Load Only

memory_order	Instruction	Time	vs relaxed
`relaxed`	`LDR`	0.46 ns/op	baseline
`acquire`	`LDAPR`	0.46 ns/op	-0.00 ns (-0.3%)
`seq_cst`	`LDAR`	0.46 ns/op	+0.00 ns

For standalone store or load, the cost difference by memory ordering is too small to measure.

Store+Load Pair (Same Thread)

memory_order	Time	vs relaxed
`relaxed`/`relaxed`	0.93 ns/pair	baseline
`release`/`acquire`	7.09 ns/pair	+6.16 ns (+661%)
`seq_cst`/`seq_cst`	7.09 ns/pair	+6.16 ns (+661%)

This is where the real cost originates. Executing LDAR/LDAPR immediately after STLR causes a pipeline stall. However, the cost of release/acquire and seq_cst is exactly the same.

fetch_add (Read-Modify-Write)

memory_order	Time
`relaxed`	6.01 ns/op
`acq_rel`	6.00 ns/op
`seq_cst`	5.99 ns/op

RMW operations have the same cost regardless of ordering. This is because they internally use LDXR/STXR or CAS loops.

Cross-Thread Store/Load

memory_order	Writer Time	vs release/acquire
`relaxed`/`relaxed`	0.49 ns/write	-2.31 ns
`release`/`acquire`	2.80 ns/write	baseline
`seq_cst`/`seq_cst`	3.14 ns/write	+0.34 ns (+12%)

In cross-thread scenarios, seq_cst is 0.34ns more expensive than release/acquire. The difference between LDAPR (acquire) and LDAR (seq_cst) shows here.

Result Analysis: Where Does the Cost Come From?

STLR to LDAR Pipeline Stall

The reason the store+load pair cost jumps from 0.93ns to 7.09ns in the same thread is due to the characteristics of the STLR instruction. To provide the guarantee that "this store must be observed after all previous memory operations", STLR delays subsequent LDAR while draining the store buffer.

                     Time ->
relaxed:  STR --- LDR ------           (0.93ns, passes through pipeline)
release:  STLR --- wait drain --- LDAPR --  (7.09ns, waits for store buffer drain)
seq_cst:  STLR --- wait drain --- LDAR ---  (7.09ns, same stall)

Key point: This cost occurs equally with memory_order_release. "Downgrading" seq_cst to release/acquire does not reduce the cost of the store+load pattern within the same thread.

LDAPR vs LDAR (Cross-Thread)

FEAT_LRCPC's LDAPR provides relaxed semantics of "don't need to wait for complete drain of previous STLR". This difference appears as the 0.34ns gap in cross-thread benchmarks.

However, this difference is negligible in absolute terms.

Actual RT Loop Impact Calculation

Atomic operation profile for a 1kHz control loop:

Operation Type	Approx Count/cycle	Cost (relaxed)	Cost (release/acquire)
Seqlock sequence store	2	0.92 ns	0.92 ns
Seqlock sequence load	4	1.84 ns	1.84 ns
Status flag load	~10	4.60 ns	4.60 ns
Store+Load pair	~5	4.65 ns	35.45 ns
Cross-thread load	~4	1.96 ns	11.20 ns
Total	~25	~14 ns	~54 ns

54ns / 1,000,000ns (1ms) = 0.0054% of the period budget is used.

Even with extremely generous calculations, it is less than 100ns, not even reaching 0.01% of the 1ms budget. Compared to PID calculation (~10us), dynamics computation (~30us), and EtherCAT PDO communication (~50us), it is completely at noise level.

Why You Should Not Switch to Relaxed

The temptation might arise: "If the cost is minimal anyway, can't we just use relaxed?" But ARM64 is a weakly-ordered architecture.

No Visibility Guarantee with Relaxed

// Thread 1 (RT controller)
data_.store(new_value, std::memory_order_relaxed);
running_.store(false, std::memory_order_relaxed);

// Thread 2 (shutdown handler)
while (running_.load(std::memory_order_relaxed)) {
    // On ARM64 this loop could be delayed indefinitely!
    // relaxed does not guarantee visibility timing.
}

memory_order_relaxed does not guarantee visibility timing to other cores. The C++ standard recommends that it "should" become visible within a reasonable time, but this is not a requirement (shall). On x86, stores propagate relatively quickly due to the TSO (Total Store Order) model, but on ARM64, values can stay in the store buffer for unpredictable durations.

Guarantees Provided by release/acquire

// Thread 1
data_.store(new_value, std::memory_order_relaxed);
running_.store(false, std::memory_order_release);  // Guarantees data_ write completes first

// Thread 2
while (running_.load(std::memory_order_acquire)) {  // acquire paired with release
    // happens-before relationship established
    // If running_ == false is observed, data_'s new_value is also guaranteed
}
// Here data_.load(relaxed) must return new_value

release/acquire establishes a happens-before relationship. It is not simply "becomes visible quickly" but a formal guarantee of "must be visible in the correct order".

Risk vs Cost

Choice	Savings	Risk
Use relaxed	~40ns/cycle (0.004%)	Visibility delay -> shutdown failure, data inconsistency, undefined behavior
Use release/acquire	0	0 (correctness guaranteed)

There is no reason to write code where the robot arm could indefinitely ignore shutdown signals to save 40ns.

Practical Guidelines

When to Use Which memory_order

                    Need correctness?
                         |
                    +----+----+
                    | Yes     | No (counters, statistics, etc.)
                    |         |
              +-----+-----+   +-- relaxed
              |           |
         Only single    Need ordering
         variable        of multiple
         visibility     variables
              |           |
              +-----+-----+
                    |
              release/acquire
                    |
              +-----+-----+
              |           |
         AArch64       x86-64
         (cost: ~3ns)  (cost: ~0ns, TSO)
              |           |
              +-----+-----+
                    |
            When seq_cst is needed:
            - Need global ordering across multiple atomic variables
            - Algorithms like Peterson's lock
            - Default when unsure

Practical Rules for AArch64

relaxed is only for pure statistics/counters. Do not use for any flags or state variables that affect logic.
release/acquire is the default choice. On AArch64, the cost is nearly identical to seq_cst while expressing intent more clearly.
There is no need to avoid seq_cst. On AArch64, choosing release/acquire over seq_cst only saves 0.34ns in cross-thread scenarios.
Avoid load immediately after store in the same thread. This is the source of the 7ns stall. If possible, interleave other operations between store and load, or reconsider the design.
RMW (fetch_add, etc.) is ordering-independent. The cost is the same regardless of memory_order, so choose the safe option (acq_rel or seq_cst).

Validity by Architecture Generation

The advice "seq_cst is expensive on ARM" is true for ARMv7 (32-bit) but does not apply to AArch64 (64-bit). Always verify the target architecture when referencing ARM memory ordering articles.

Architecture	seq_cst Additional Cost	Effect of Avoiding seq_cst
ARMv7 (32-bit)	1 additional DMB (tens to hundreds of ns)	Significant
AArch64 (64-bit)	0ns (same instruction)	None
AArch64 + LRCPC	~0.3ns on load	Negligible

Key Takeaways

ARMv7 and AArch64 are different. The advice "seq_cst is expensive on ARM" only applies to 32-bit ARMv7. On AArch64, release/acquire and seq_cst compile to the same instructions (STLR/LDAR).
The real source of cost is the pipeline stall from store+load pairs (~7ns). This cost occurs equally with release/acquire and does not decrease by avoiding seq_cst.
Atomics are not a bottleneck in a 1kHz RT loop. The total cost of 25 atomic operations is ~54ns, which is 0.005% of the 1ms budget. Compared to dynamics computation (~30us) or EtherCAT communication (~50us), it is at noise level.
Do not switch to relaxed. ARM64 is a weakly-ordered architecture. Giving up visibility guarantees to save ~40ns creates critical risks like shutdown failure and data inconsistency.
Correct code always comes before fast code. And on AArch64, correct code is also fast code.

What is the Real Cost of std::memory_order on ARM64? - Jetson Orin Benchmark

Conclusion First​

Background: Why We Did This Analysis​

Instruction Mapping by ARM Architecture Generation​

ARMv7 (32-bit) - Barrier Based​

ARMv8 AArch64 (64-bit) - Dedicated Instructions​

ARMv8.3+ FEAT_LRCPC - Subtle Differences Emerge​

Jetson Orin Benchmark Results​

Benchmark Environment​

Benchmark Code (Core Section)​

Measurement Results​

Store Only​

Load Only​

Store+Load Pair (Same Thread)​

fetch_add (Read-Modify-Write)​

Cross-Thread Store/Load​

Result Analysis: Where Does the Cost Come From?​

STLR to LDAR Pipeline Stall​

LDAPR vs LDAR (Cross-Thread)​

Actual RT Loop Impact Calculation​

Why You Should Not Switch to Relaxed​

No Visibility Guarantee with Relaxed​

Guarantees Provided by release/acquire​

Risk vs Cost​

Practical Guidelines​

When to Use Which memory_order​

Practical Rules for AArch64​

Validity by Architecture Generation​

Key Takeaways​