Lock-free vs Mutex: Robot Control System IPC Performance Benchmark
The intuition that "In-Process communication is faster than Inter-Process communication" is wrong.
What determines performance is the synchronization mechanism, not process boundaries.
In this article, we compare the performance of Mutex-based and Lock-free synchronization in a 4-stage pipeline simulating a robot control system.
Conclusions First
| IPC Method | Before Optimization (Mutex) | After Optimization (Lock-free) | Improvement |
|---|---|---|---|
| in_process | 78-103 us | 0.74-0.82 us | ~100x |
| shared_memory | 0.7-0.8 us | 0.75-0.78 us | No change |
| pipe | 33-106 us | 68-87 us | No change |
After applying Lock-free, in_process and shared_memory achieve identical performance.
Test Environment
4-Stage Pipeline (Robot Control Simulation)
Planner -> IK Solver -> EtherCAT Master -> Mock Hardware
| Stage | Processing Content |
|---|---|
| Planner | Cartesian trajectory generation |
| IK Solver | Joint angle calculation |
| EtherCAT Master | EtherCAT frame generation |
| Mock Hardware | Actuator response simulation |
Test Conditions
| Parameter | Value |
|---|---|
| Period | 1ms, 2ms, 4ms, 10ms |
| Test Duration | 10 seconds/test |
| Warmup | 100 iterations |
| DOF | 6 DOF |
| Message Size | Fixed (no dynamic allocation) |
Measured Metrics
- Cycle Latency: Full pipeline round-trip time
- Jitter: Latency variation
- Percentiles: P50, P95, P99, P99.9
- Deadline Misses: Number of period time overruns
Why is Mutex Slow?
Hidden Costs of Mutex
| Operation | Latency |
|---|---|
| Uncontended mutex lock | ~25-75 ns |
| Contended mutex lock | ~1-15 us (kernel futex call) |
| Condition variable notify | ~2-10 us |
| Thread wakeup | ~10-50 us (context switching) |
| Total per communication | ~50-100 us |
Mutex Communication Flow
Thread A Kernel Thread B
| | |
+-- mutex.lock() >| |
| (futex syscall) |
|<-- acquired ----| |
| [critical section] |
+-- cv.notify() -->|-- wakeup ----->|
| (futex syscall) (scheduling) |
| | |
+---------- Total ~50-100us --------+
Core Problem: Kernel intervention (futex syscall) occurs under contention.
The Principle of Lock-free
Lock-free Costs
| Operation | Latency |
|---|---|
atomic_store (release) | ~10-20 ns |
atomic_load (acquire) | ~10-20 ns |
std::this_thread::yield() | ~100-500 ns |
| Kernel intervention | None |
| Total per communication | ~0.7-0.8 us |
Lock-free Communication Flow
Thread A Thread B
| |
+-- atomic_store() ------------------>|
| (single CPU instruction, ~10ns) |
| +-- atomic_load()
| | (single CPU instruction)
| |
+---------- Total ~0.7-0.8us ---------+
Key Difference: Synchronization using only CPU instructions without kernel intervention.
x86 Memory Ordering
x86 architecture has a strong memory model, so:
- No additional instructions needed for
memory_order_acquire/release - Effectively zero overhead
Detailed Benchmark Results
After Optimization (Lock-free)
| IPC Method | Period | Mean (us) | P99 (us) | Jitter (us) | Misses |
|---|---|---|---|---|---|
| in_process | 1ms | 0.79 | 1.90 | 999.21 | 0 |
| in_process | 2ms | 0.74 | 1.34 | 1999.26 | 0 |
| in_process | 4ms | 0.76 | 1.41 | 3999.24 | 0 |
| in_process | 10ms | 0.82 | 2.30 | 9999.18 | 0 |
| shared_memory | 1ms | 0.78 | 1.29 | 999.22 | 0 |
| shared_memory | 2ms | 0.76 | 1.79 | 1999.24 | 0 |
| shared_memory | 4ms | 0.75 | 1.45 | 3999.25 | 0 |
| shared_memory | 10ms | 0.77 | 1.82 | 9999.23 | 0 |
| pipe | 1ms | 68.92 | 174.00 | 931.08 | 0 |
| pipe | 2ms | 86.62 | 179.07 | 1913.38 | 0 |
| pipe | 4ms | 79.47 | 173.84 | 3920.53 | 0 |
| pipe | 10ms | 82.06 | 174.79 | 9917.94 | 0 |
Final Performance Ranking
- in_process (Lock-free): 0.74 us
- shared_memory (Lock-free): 0.75 us
- pipe (Kernel IPC): 68.92 us (~90x slower)
Why In-Process = Shared Memory with Lock-free?
Both use the same Lock-free Atomic pattern:
- Same spin-wait mechanism
- Only difference: memory location (heap vs shared memory)
Memory location does not affect performance.
Lock-free Implementation Checklist
- Cache Line Alignment (Prevent False Sharing)
alignas(64) std::atomic<size_t> writeIdx_; // 64-byte alignment
alignas(64) std::atomic<size_t> readIdx_;
- Correct Memory Ordering (release/acquire semantics)
buffer_[writeIdx].store(data, std::memory_order_release);
auto data = buffer_[readIdx].load(std::memory_order_acquire);
- Power-of-2 Buffer Size (Fast modulo operation)
constexpr size_t BUFFER_SIZE = 8192; // 2^13
size_t next = (current + 1) & (BUFFER_SIZE - 1); // & instead of %
- Fixed-size Messages (No dynamic allocation)
Common Implementation Mistakes
| Mistake | Problem |
|---|---|
| Using non-atomic operations | Data races |
memory_order_relaxed | No ordering guarantees |
| Missing cache line alignment | False Sharing |
Architecture Choice: In-Process vs Inter-Process
| Criterion | In-Process (Lock-free) | Inter-Process |
|---|---|---|
| Latency | ~0.7 us | ~50-200 us |
| Determinism | Very high | Kernel scheduler dependent |
| Fault Isolation | None | Process-level isolation |
| Fault Behavior | Full stop (Fail-Stop) | Partial operation (Dangerous) |
| Recovery Method | Full restart | Individual process restart |
| Memory Protection | None (bug corrupts everything) | Address space separation |
| Debugging | Easy | Distributed tracing required |
Why Fail-Stop Matters in Robot Control
Scenario: IK Solver process crashes
| Architecture | Behavior | Result |
|---|---|---|
| Inter-Process | Other processes continue running | Motors continue with stale commands -> Uncontrolled state |
| In-Process | Entire process stops | Motor brakes automatically engage -> Safe state |
"Partial failure is more dangerous than complete stop."
Recommended Architecture: Hybrid
+----------------------------------------------------------+
| Real-time Control Core (In-Process, Lock-free) |
| Planner -> IK -> EtherCAT -> Motor Driver |
| - 1kHz+ control loop |
| - Fail-Stop on failure |
| - Lock-free Atomic synchronization |
+----------------------------------------------------------+
| IPC (Shared Memory / Socket) - Non-realtime |
+----------------------------------------------------------+
| [Camera] [UI Server] [Logging] [Monitoring] |
| - Individual process failures tolerated |
| - Independent restart possible |
| - No impact on real-time core |
+----------------------------------------------------------+
Recommendations by Use Case
| Scenario | Recommended Approach | Reason |
|---|---|---|
| 1kHz+ real-time control | In-Process + Lock-free | Lowest latency, Fail-Stop |
| Multi-machine distributed | Inter-Process + Socket | Network communication required |
| Process isolation needed | Inter-Process + SHM | Security/stability |
| Rapid prototyping | In-Process + Mutex | Simple implementation |
| Production robots | Hybrid | Real-time core + non-real-time services |
Synchronization Mechanism Selection Criteria
| Requirement | Recommendation |
|---|---|
| < 10us latency needed | Lock-free Atomic (~0.7-1 us, high complexity) |
| > 100us latency acceptable | Mutex-based queue (~50-100 us, simple implementation) |
Key Takeaways
-
"Performance is determined by the synchronization mechanism." Not process boundaries.
- Mutex-based in_process: 78-103 us
- Lock-free shared_memory: 0.7-0.8 us
-
With Lock-free, in_process and shared_memory have identical performance. Both ~0.7 us.
-
Hidden cost of Mutex: Kernel futex calls cause ~50-100 us delay under contention.
-
Key to Lock-free: Uses only CPU atomic instructions without kernel intervention.
-
Fail-Stop matters in robot control: Partial failure is more dangerous than complete stop.
-
Hybrid architecture recommended: In-Process + Lock-free for real-time core, Inter-Process for non-real-time services.