Lock-free vs Mutex: Robot Control System IPC Performance Benchmark

The intuition that "In-Process communication is faster than Inter-Process communication" is wrong.

What determines performance is the synchronization mechanism, not process boundaries.

In this article, we compare the performance of Mutex-based and Lock-free synchronization in a 4-stage pipeline simulating a robot control system.

Conclusions First

IPC Method	Before Optimization (Mutex)	After Optimization (Lock-free)	Improvement
in_process	78-103 us	0.74-0.82 us	~100x
shared_memory	0.7-0.8 us	0.75-0.78 us	No change
pipe	33-106 us	68-87 us	No change

After applying Lock-free, in_process and shared_memory achieve identical performance.

Test Environment

4-Stage Pipeline (Robot Control Simulation)

Planner -> IK Solver -> EtherCAT Master -> Mock Hardware

Stage	Processing Content
Planner	Cartesian trajectory generation
IK Solver	Joint angle calculation
EtherCAT Master	EtherCAT frame generation
Mock Hardware	Actuator response simulation

Test Conditions

Parameter	Value
Period	1ms, 2ms, 4ms, 10ms
Test Duration	10 seconds/test
Warmup	100 iterations
DOF	6 DOF
Message Size	Fixed (no dynamic allocation)

Measured Metrics

Cycle Latency: Full pipeline round-trip time
Jitter: Latency variation
Percentiles: P50, P95, P99, P99.9
Deadline Misses: Number of period time overruns

Why is Mutex Slow?

Hidden Costs of Mutex

Operation	Latency
Uncontended mutex lock	~25-75 ns
Contended mutex lock	~1-15 us (kernel futex call)
Condition variable notify	~2-10 us
Thread wakeup	~10-50 us (context switching)
Total per communication	~50-100 us

Mutex Communication Flow

Thread A          Kernel           Thread B
   |                |                  |
   +-- mutex.lock() >|                  |
   |   (futex syscall)                  |
   |<-- acquired ----|                  |
   |   [critical section]              |
   +-- cv.notify() -->|-- wakeup ----->|
   |   (futex syscall) (scheduling)    |
   |                  |                |
   +---------- Total ~50-100us --------+

Core Problem: Kernel intervention (futex syscall) occurs under contention.

The Principle of Lock-free

Lock-free Costs

Operation	Latency
`atomic_store` (release)	~10-20 ns
`atomic_load` (acquire)	~10-20 ns
`std::this_thread::yield()`	~100-500 ns
Kernel intervention	None
Total per communication	~0.7-0.8 us

Lock-free Communication Flow

Thread A                              Thread B
   |                                     |
   +-- atomic_store() ------------------>|
   |   (single CPU instruction, ~10ns)   |
   |                                     +-- atomic_load()
   |                                     |   (single CPU instruction)
   |                                     |
   +---------- Total ~0.7-0.8us ---------+

Key Difference: Synchronization using only CPU instructions without kernel intervention.

x86 Memory Ordering

x86 architecture has a strong memory model, so:

No additional instructions needed for memory_order_acquire/release
Effectively zero overhead

Detailed Benchmark Results

After Optimization (Lock-free)

IPC Method	Period	Mean (us)	P99 (us)	Jitter (us)
in_process	1ms	0.79	1.90	999.21
in_process	2ms	0.74	1.34	1999.26
in_process	4ms	0.76	1.41	3999.24
in_process	10ms	0.82	2.30	9999.18
shared_memory	1ms	0.78	1.29	999.22
shared_memory	2ms	0.76	1.79	1999.24
shared_memory	4ms	0.75	1.45	3999.25
shared_memory	10ms	0.77	1.82	9999.23
pipe	1ms	68.92	174.00	931.08
pipe	2ms	86.62	179.07	1913.38
pipe	4ms	79.47	173.84	3920.53
pipe	10ms	82.06	174.79	9917.94

Final Performance Ranking

in_process (Lock-free): 0.74 us
shared_memory (Lock-free): 0.75 us
pipe (Kernel IPC): 68.92 us (~90x slower)

Why In-Process = Shared Memory with Lock-free?

Both use the same Lock-free Atomic pattern:

Same spin-wait mechanism
Only difference: memory location (heap vs shared memory)

Memory location does not affect performance.

Lock-free Implementation Checklist

Cache Line Alignment (Prevent False Sharing)

alignas(64) std::atomic<size_t> writeIdx_;  // 64-byte alignment
alignas(64) std::atomic<size_t> readIdx_;

Correct Memory Ordering (release/acquire semantics)

buffer_[writeIdx].store(data, std::memory_order_release);
auto data = buffer_[readIdx].load(std::memory_order_acquire);

Power-of-2 Buffer Size (Fast modulo operation)

constexpr size_t BUFFER_SIZE = 8192;  // 2^13
size_t next = (current + 1) & (BUFFER_SIZE - 1);  // & instead of %

Fixed-size Messages (No dynamic allocation)

Common Implementation Mistakes

Mistake	Problem
Using non-atomic operations	Data races
`memory_order_relaxed`	No ordering guarantees
Missing cache line alignment	False Sharing

Architecture Choice: In-Process vs Inter-Process

Criterion	In-Process (Lock-free)	Inter-Process
Latency	~0.7 us	~50-200 us
Determinism	Very high	Kernel scheduler dependent
Fault Isolation	None	Process-level isolation
Fault Behavior	Full stop (Fail-Stop)	Partial operation (Dangerous)
Recovery Method	Full restart	Individual process restart
Memory Protection	None (bug corrupts everything)	Address space separation
Debugging	Easy	Distributed tracing required

Why Fail-Stop Matters in Robot Control

Scenario: IK Solver process crashes

Architecture	Behavior	Result
Inter-Process	Other processes continue running	Motors continue with stale commands -> Uncontrolled state
In-Process	Entire process stops	Motor brakes automatically engage -> Safe state

"Partial failure is more dangerous than complete stop."

Recommended Architecture: Hybrid

+----------------------------------------------------------+
|  Real-time Control Core (In-Process, Lock-free)          |
|  Planner -> IK -> EtherCAT -> Motor Driver               |
|  - 1kHz+ control loop                                    |
|  - Fail-Stop on failure                                  |
|  - Lock-free Atomic synchronization                      |
+----------------------------------------------------------+
|  IPC (Shared Memory / Socket) - Non-realtime             |
+----------------------------------------------------------+
|  [Camera]  [UI Server]  [Logging]  [Monitoring]          |
|  - Individual process failures tolerated                 |
|  - Independent restart possible                          |
|  - No impact on real-time core                           |
+----------------------------------------------------------+

Recommendations by Use Case

Scenario	Recommended Approach	Reason
1kHz+ real-time control	In-Process + Lock-free	Lowest latency, Fail-Stop
Multi-machine distributed	Inter-Process + Socket	Network communication required
Process isolation needed	Inter-Process + SHM	Security/stability
Rapid prototyping	In-Process + Mutex	Simple implementation
Production robots	Hybrid	Real-time core + non-real-time services

Synchronization Mechanism Selection Criteria

Requirement	Recommendation
< 10us latency needed	Lock-free Atomic (~0.7-1 us, high complexity)
> 100us latency acceptable	Mutex-based queue (~50-100 us, simple implementation)

Key Takeaways

"Performance is determined by the synchronization mechanism." Not process boundaries.
- Mutex-based in_process: 78-103 us
- Lock-free shared_memory: 0.7-0.8 us
With Lock-free, in_process and shared_memory have identical performance. Both ~0.7 us.
Hidden cost of Mutex: Kernel futex calls cause ~50-100 us delay under contention.
Key to Lock-free: Uses only CPU atomic instructions without kernel intervention.
Fail-Stop matters in robot control: Partial failure is more dangerous than complete stop.
Hybrid architecture recommended: In-Process + Lock-free for real-time core, Inter-Process for non-real-time services.

Lock-free vs Mutex: Robot Control System IPC Performance Benchmark

Conclusions First​

Test Environment​

4-Stage Pipeline (Robot Control Simulation)​

Test Conditions​

Measured Metrics​

Why is Mutex Slow?​

Hidden Costs of Mutex​

Mutex Communication Flow​

The Principle of Lock-free​

Lock-free Costs​

Lock-free Communication Flow​

x86 Memory Ordering​

Detailed Benchmark Results​

After Optimization (Lock-free)​

Final Performance Ranking​

Why In-Process = Shared Memory with Lock-free?​

Lock-free Implementation Checklist​

Common Implementation Mistakes​

Architecture Choice: In-Process vs Inter-Process​

Why Fail-Stop Matters in Robot Control​

Recommended Architecture: Hybrid​

Recommendations by Use Case​

Synchronization Mechanism Selection Criteria​

Key Takeaways​