Skip to main content
1kHz Real-Time Robot Control System Monitoring Architecture
Embedded/Real-time

1kHz Real-Time Robot Control System Monitoring Architecture

Design a monitoring architecture that safely collects, stores, and visualizes performance data from EtherCAT-based 1kHz real-time control loops. Maintain RT determinism while capturing debugging data using Lock-free SPSC Queues and ROS2 topic layering.

WRWIM Robotics Team
·
roboticsreal-timemonitoringros2ethercatlock-free

1kHz Real-Time Robot Control System Monitoring Architecture

When collecting performance data from a 1kHz real-time control loop, you face a dilemma:

RT Loop RequirementsMonitoring Requirements
Deterministic executionData storage (memory/disk I/O)
No memory allocationNetwork transmission (ROS2 publishing)
No blocking callsStatistics calculation
No exception handlingVisualization

This article designs a monitoring architecture that collects high-resolution performance data while maintaining RT determinism.

Architecture Overview

Core Pattern: RT/Non-RT Producer-Consumer

  • Producer: RT thread pushes data to lock-free queue (no blocking)
  • Consumer: Non-RT thread polls queue and publishes to ROS2 topics

Why Thread Separation? (Instead of Process Separation)

AspectThread SeparationProcess Separation
Memory sharingDirect heap memory accessIPC required
LatencyMicrosecond levelAdditional IPC overhead
Lifecycle managementSingle processMultiple processes

Data Structures

RtSample (216 bytes)

Complete RT cycle data for each 1ms period:

CategoryDataBytesPurpose
Timingmonotonic_ns, sequence16Sequence tracking
Loop Performanceloop_exec_us, loop_period_us, loop_jitter_us, deadline_miss17RT performance analysis
Joint State6-axis actual/cmd (position, velocity, torque)144Control quality evaluation
Drive StateCiA 402 status_word, control_word, op_mode30Servo diagnostics
Fieldbuswkc, wkc_mismatch, link_error4Communication stability

Constraint: Must be trivially_copyable for lock-free queue operations.

Event (64 bytes, Cache-line aligned)

Structure for event-based notifications:

struct alignas(64) Event {
// Classification (4 bytes)
EventType type;
uint8_t source_id;
EventSeverity severity;
uint8_t joint_id;

// Timing (24 bytes)
uint64_t monotonic_ns;
uint64_t event_sequence;
uint64_t ref_sample_seq;

// Data (24 bytes)
int32_t error_code;
uint8_t extra_len;
uint8_t extra[21];

// Numeric (4 bytes)
float value;

// Padding (8 bytes) - 64-byte alignment
};

Lock-Free SPSC Queue

Implementation Choice: rigtorp::SPSCQueue

  • Wait-free: Producer performs at most 2 atomic loads on 64-bit Linux
  • False sharing prevention: 64-byte cache line alignment
// Cache line alignment to prevent False Sharing
static constexpr size_t kCacheLineSize = 64;

alignas(kCacheLineSize) std::atomic<size_t> writeIdx_ = {0}; // Only Producer writes
alignas(kCacheLineSize) size_t readIdxCache_ = 0; // Producer local cache
alignas(kCacheLineSize) std::atomic<size_t> readIdx_ = {0}; // Only Consumer writes
alignas(kCacheLineSize) size_t writeIdxCache_ = 0; // Consumer local cache

Performance Benchmark (100,000 iterations, Release -O2)

OperationMeanP99Notes
RtSample creation + queue push0.3 us0.7 us216B struct
Event creation + queue push~0.1 us0.3 us64B struct
Memory usage (queue)1.8 MB-8192 samples + 512 events

Total monitoring overhead: < 0.05% of 1ms period

ROS2 Topic Design (3-Tier)

TopicFrequencyQoSPurpose
/rt_raw1kHzbest_effort, depth=200Full recording, post-analysis
/rt_eventsOn eventreliable + transient_local, depth=50Event notifications
/rt_monitor_stats10Hzreliable, depth=20Real-time health dashboard

Decimation (1kHz to 10Hz)

// /rt_raw publishes every sample (1kHz)
rt_raw_pub_->publish(to_rt_raw(sample, now));
sample_count_++;

// /rt_monitor_stats uses 100:1 decimation (10Hz)
if (sample_count_ % 100 == 0) {
publish_stats();
}

Edge Detection (Preventing Duplicate Events)

// Rising edge detection: emit event only when current=true && previous=false
const bool faulted = servo.faulted;
if (faulted && !prev_faulted_) {
emit(EventType::SERVO_FAULT, ...);
}
prev_faulted_ = faulted;

Cooldown Mechanism (Preventing Event Storms)

Continuous events like deadline misses can generate thousands of events in a short time. To prevent this, apply a 100ms cooldown:

// Ignore same event type if it occurs within 100ms
constexpr auto kEventCooldown = std::chrono::milliseconds(100);

if (now - last_event_time_[type] > kEventCooldown) {
emit(type, ...);
last_event_time_[type] = now;
}

Storage and Visualization

MCAP Format

  • Native PlotJuggler support
  • Efficient time-based indexing
  • Compression options (zstd, lz4)
rosbag2 QoS Compatibility Note

When recording the /rt_raw topic published with best_effort QoS, it may conflict with rosbag2's default QoS (reliable). Use a QoS override file:

# qos_override.yaml
/rt_raw:
reliability: best_effort
history: keep_last
depth: 200
ros2 bag record /rt_raw /rt_events --qos-profile-overrides-path qos_override.yaml

Storage Capacity Calculation

TopicCalculationHourly Storage
/rt_raw1kHz x ~250B x 3600s~1.0 GB
/rt_eventsVariable based on event frequency10-50 MB
/rt_monitor_stats10Hz x 80B x 3600s~3 MB
Total~1.0-1.1 GB

Rolling Retention (External Script)

Two-stage cleanup policy for unlimited operation:

# Time-based: Delete files older than 60 minutes
cleanup_old_files() {
find "$BAG_DIR" -name "*.mcap" -mmin +"$RETENTION_MIN" -delete
}

# Capacity-based: FIFO deletion when disk usage exceeds 70%
cleanup_disk_space() {
# Delete oldest files first
}

PlotJuggler Configuration

PlotJuggler Monitoring Screen

Real-time visualization of 6-axis joint actual_pos vs cmd_pos using PlotJuggler

Streaming Mode Settings:

  • Mode: ROS2 Topic Subscriber
  • Time Window: 180 seconds (3-minute rolling)
  • Max Points: 20,000
Exclude uint64 Fields

monotonic_ns, sequence, and other uint64 fields may cause truncation errors in PlotJuggler. Exclude these fields from visualization.

Monitoring Health Metrics

Self-Monitoring

FieldMeaningWarning Threshold
rt_queue_fill_pctSPSC queue utilization> 70%
rt_overflow_deltaQueue overflow count> 0
publisher_lag_msRT to Non-RT propagation delay> 50ms
seq_gap_count_deltaLost sample count> 0

Operational Thresholds

MetricNormalWarningCritical
rt_queue_fill_pct< 50%> 70%> 90%
publisher_lag_ms< 10ms> 50ms> 100ms
loop_jitter_us (P99)< 50us> 100us> 200us
seq_gap_count0> 0-

Operational Recommendations

Development/Debugging Purpose

This monitoring system is designed for development and debugging purposes.

SituationRecommendation
Development environmentEnable PlotJuggler real-time visualization
ProductionDisable 1kHz topic publishing or enable only when needed
When issues occurEnable monitoring to analyze root cause

Note: PlotJuggler real-time visualization consumes significant CPU resources. 1kHz topic publishing can also cause system load.

Key Takeaways

  1. RT/Non-RT Separation: Lock-free SPSC Queue maintains RT thread determinism while transferring data to Non-RT thread.

  2. 3-Tier Topic Design: Optimize QoS for each purpose.

    • /rt_raw (1kHz): Full recording
    • /rt_events: Event notifications
    • /rt_monitor_stats (10Hz): Dashboard
  3. Lock-free Queue Performance: RtSample push takes 0.3us, less than 0.05% overhead of 1ms period.

  4. MCAP Rolling Retention: External scripts for time/capacity-based cleanup support unlimited operation.

  5. Self-Monitoring: Track queue utilization, latency, and lost samples to verify the health of the monitoring system itself.

  6. Development Tool: Use PlotJuggler and 1kHz publishing only in development environments; enable in production only when needed.