Skip to main content
Embedded/Real-time

Lock-free vs Mutex: Robot Control System IPC Performance Benchmark

The factor determining IPC performance in robot control systems is the synchronization mechanism, not process boundaries. We share a case study achieving 100x improvement from 78-103us with Mutex to 0.74-0.82us with Lock-free.

WRWIM Robotics Team
·
roboticsipclock-freemutexreal-timeperformance

Lock-free vs Mutex: Robot Control System IPC Performance Benchmark

The intuition that "In-Process communication is faster than Inter-Process communication" is wrong.

What determines performance is the synchronization mechanism, not process boundaries.

In this article, we compare the performance of Mutex-based and Lock-free synchronization in a 4-stage pipeline simulating a robot control system.

Conclusions First

IPC MethodBefore Optimization (Mutex)After Optimization (Lock-free)Improvement
in_process78-103 us0.74-0.82 us~100x
shared_memory0.7-0.8 us0.75-0.78 usNo change
pipe33-106 us68-87 usNo change

After applying Lock-free, in_process and shared_memory achieve identical performance.

Test Environment

4-Stage Pipeline (Robot Control Simulation)

Planner -> IK Solver -> EtherCAT Master -> Mock Hardware
StageProcessing Content
PlannerCartesian trajectory generation
IK SolverJoint angle calculation
EtherCAT MasterEtherCAT frame generation
Mock HardwareActuator response simulation

Test Conditions

ParameterValue
Period1ms, 2ms, 4ms, 10ms
Test Duration10 seconds/test
Warmup100 iterations
DOF6 DOF
Message SizeFixed (no dynamic allocation)

Measured Metrics

  • Cycle Latency: Full pipeline round-trip time
  • Jitter: Latency variation
  • Percentiles: P50, P95, P99, P99.9
  • Deadline Misses: Number of period time overruns

Why is Mutex Slow?

Hidden Costs of Mutex

OperationLatency
Uncontended mutex lock~25-75 ns
Contended mutex lock~1-15 us (kernel futex call)
Condition variable notify~2-10 us
Thread wakeup~10-50 us (context switching)
Total per communication~50-100 us

Mutex Communication Flow

Thread A          Kernel           Thread B
| | |
+-- mutex.lock() >| |
| (futex syscall) |
|<-- acquired ----| |
| [critical section] |
+-- cv.notify() -->|-- wakeup ----->|
| (futex syscall) (scheduling) |
| | |
+---------- Total ~50-100us --------+

Core Problem: Kernel intervention (futex syscall) occurs under contention.

The Principle of Lock-free

Lock-free Costs

OperationLatency
atomic_store (release)~10-20 ns
atomic_load (acquire)~10-20 ns
std::this_thread::yield()~100-500 ns
Kernel interventionNone
Total per communication~0.7-0.8 us

Lock-free Communication Flow

Thread A                              Thread B
| |
+-- atomic_store() ------------------>|
| (single CPU instruction, ~10ns) |
| +-- atomic_load()
| | (single CPU instruction)
| |
+---------- Total ~0.7-0.8us ---------+

Key Difference: Synchronization using only CPU instructions without kernel intervention.

x86 Memory Ordering

x86 architecture has a strong memory model, so:

  • No additional instructions needed for memory_order_acquire/release
  • Effectively zero overhead

Detailed Benchmark Results

After Optimization (Lock-free)

IPC MethodPeriodMean (us)P99 (us)Jitter (us)Misses
in_process1ms0.791.90999.210
in_process2ms0.741.341999.260
in_process4ms0.761.413999.240
in_process10ms0.822.309999.180
shared_memory1ms0.781.29999.220
shared_memory2ms0.761.791999.240
shared_memory4ms0.751.453999.250
shared_memory10ms0.771.829999.230
pipe1ms68.92174.00931.080
pipe2ms86.62179.071913.380
pipe4ms79.47173.843920.530
pipe10ms82.06174.799917.940

Final Performance Ranking

  1. in_process (Lock-free): 0.74 us
  2. shared_memory (Lock-free): 0.75 us
  3. pipe (Kernel IPC): 68.92 us (~90x slower)

Why In-Process = Shared Memory with Lock-free?

Both use the same Lock-free Atomic pattern:

  • Same spin-wait mechanism
  • Only difference: memory location (heap vs shared memory)

Memory location does not affect performance.

Lock-free Implementation Checklist

  1. Cache Line Alignment (Prevent False Sharing)
alignas(64) std::atomic<size_t> writeIdx_;  // 64-byte alignment
alignas(64) std::atomic<size_t> readIdx_;
  1. Correct Memory Ordering (release/acquire semantics)
buffer_[writeIdx].store(data, std::memory_order_release);
auto data = buffer_[readIdx].load(std::memory_order_acquire);
  1. Power-of-2 Buffer Size (Fast modulo operation)
constexpr size_t BUFFER_SIZE = 8192;  // 2^13
size_t next = (current + 1) & (BUFFER_SIZE - 1); // & instead of %
  1. Fixed-size Messages (No dynamic allocation)

Common Implementation Mistakes

MistakeProblem
Using non-atomic operationsData races
memory_order_relaxedNo ordering guarantees
Missing cache line alignmentFalse Sharing

Architecture Choice: In-Process vs Inter-Process

CriterionIn-Process (Lock-free)Inter-Process
Latency~0.7 us~50-200 us
DeterminismVery highKernel scheduler dependent
Fault IsolationNoneProcess-level isolation
Fault BehaviorFull stop (Fail-Stop)Partial operation (Dangerous)
Recovery MethodFull restartIndividual process restart
Memory ProtectionNone (bug corrupts everything)Address space separation
DebuggingEasyDistributed tracing required

Why Fail-Stop Matters in Robot Control

Scenario: IK Solver process crashes

ArchitectureBehaviorResult
Inter-ProcessOther processes continue runningMotors continue with stale commands -> Uncontrolled state
In-ProcessEntire process stopsMotor brakes automatically engage -> Safe state

"Partial failure is more dangerous than complete stop."

+----------------------------------------------------------+
| Real-time Control Core (In-Process, Lock-free) |
| Planner -> IK -> EtherCAT -> Motor Driver |
| - 1kHz+ control loop |
| - Fail-Stop on failure |
| - Lock-free Atomic synchronization |
+----------------------------------------------------------+
| IPC (Shared Memory / Socket) - Non-realtime |
+----------------------------------------------------------+
| [Camera] [UI Server] [Logging] [Monitoring] |
| - Individual process failures tolerated |
| - Independent restart possible |
| - No impact on real-time core |
+----------------------------------------------------------+

Recommendations by Use Case

ScenarioRecommended ApproachReason
1kHz+ real-time controlIn-Process + Lock-freeLowest latency, Fail-Stop
Multi-machine distributedInter-Process + SocketNetwork communication required
Process isolation neededInter-Process + SHMSecurity/stability
Rapid prototypingIn-Process + MutexSimple implementation
Production robotsHybridReal-time core + non-real-time services

Synchronization Mechanism Selection Criteria

RequirementRecommendation
< 10us latency neededLock-free Atomic (~0.7-1 us, high complexity)
> 100us latency acceptableMutex-based queue (~50-100 us, simple implementation)

Key Takeaways

  1. "Performance is determined by the synchronization mechanism." Not process boundaries.

    • Mutex-based in_process: 78-103 us
    • Lock-free shared_memory: 0.7-0.8 us
  2. With Lock-free, in_process and shared_memory have identical performance. Both ~0.7 us.

  3. Hidden cost of Mutex: Kernel futex calls cause ~50-100 us delay under contention.

  4. Key to Lock-free: Uses only CPU atomic instructions without kernel intervention.

  5. Fail-Stop matters in robot control: Partial failure is more dangerous than complete stop.

  6. Hybrid architecture recommended: In-Process + Lock-free for real-time core, Inter-Process for non-real-time services.