YOLO Model Optimization: Achieving 2x Faster Inference on Jetson Orin AGX with TensorRT and DeepStream

We reduced PyTorch-based YOLO model inference time from 15ms to 7ms, boosting FPS from 67 to 142. This article shares our optimization journey using TensorRT and DeepStream for an automated recycled plastic sorting system.

The Problem: Racing Against Milliseconds on the Conveyor

We needed to classify plastics moving across a conveyor belt at 0.5m/s. Three types: PP (polypropylene), PS (polystyrene), and PE (polyethylene). The entire process from detection to delta robot pickup had to complete within 50ms.

We started with a PyTorch + OpenCV combination. Inference time was 15ms at 67 FPS. While this met the 30 FPS requirement, detection misses occurred when objects were densely packed. We needed faster inference.

Item	Requirement	Initial Result
Target FPS	30+ FPS	67 FPS
Max Latency	Under 50ms	15ms
Detection Accuracy	Over 99%	99.2%
Conveyor Speed	0.4-0.6m/s	-

Why Jetson Orin AGX?

Essential Requirements for Edge AI

Cloud inference was not an option. Network latency alone could exceed 50ms, and network failures would halt the entire line. Edge AI with on-site inference was essential.

Why we chose Jetson Orin AGX:

Item	Jetson Orin AGX Specifications
GPU	NVIDIA Ampere architecture, 2048 CUDA cores, 64 Tensor cores
AI Performance	Up to 275 TOPS (INT8)
CPU	12-core Arm Cortex-A78AE
Memory	64GB LPDDR5 (unified memory)
Power Consumption	15W - 60W (configurable)
Size	Compact module form factor

Industrial Environment Suitability

Requirement	Jetson Orin AGX
Real-time Processing	High-speed inference with 275 TOPS AI performance
Compact Installation	Small form factor for easy on-site installation
Heat/Power	Fanless or small heatsink operation, low power
Reliability	Industrial temperature range support (-25C to 80C)
Network Independence	Local inference unaffected by network failures

Advantages of Unified Memory Architecture

Standard desktop GPUs have separate CPU and GPU memory. Processing data on the GPU requires CPU-to-GPU copying, which becomes a bottleneck.

Jetson uses Unified Memory Architecture. The CPU and GPU share the same physical memory, enabling direct access without data copying.

Memory Architecture Comparison

This architecture enables:

Zero-Copy Data Transfer: Camera frames can be processed directly on GPU without copying
Memory Efficiency: No need for duplicate data storage across CPU/GPU
Low Latency: No PCIe transfer delays

TensorRT: Inference-Optimized Engine

PyTorch Limitations

Using PyTorch-trained models directly for inference introduces overhead:

Dynamic computation graph construction overhead
Python GIL (Global Interpreter Lock) impact
Dynamic memory allocation

TensorRT leverages the fact that model structure is fixed. All optimizations are performed at compile time, and runtime executes pure computation only.

Key Optimization Techniques

1. Layer Fusion

Each layer in a deep learning model runs as a separate CUDA kernel. Each kernel involves GPU memory read/write operations and kernel launch overhead.

TensorRT fuses consecutive layers into a single kernel:

Layer Fusion Optimization

The Convolution-BatchNorm-SiLU pattern repeats dozens of times in YOLO models. Fusing all these patterns significantly reduces kernel execution count.

2. Precision Calibration

Training uses FP32, but inference often works well with lower precision.

Jetson Orin's Ampere GPU accelerates FP16 and INT8 operations at the hardware level through Tensor Cores.

Mode	Advantages	Disadvantages	Recommended Use Case
FP32	No accuracy loss	No Tensor Core usage, slow	Debugging or accuracy verification
FP16	Tensor Core usage, 2x+ faster, half memory	Very rare accuracy loss	Jetson default recommendation
INT8	Maximum performance, optimal Tensor Core usage	Calibration required	Speed-critical applications

INT8 requires a Calibration process. Representative datasets are passed through the model to analyze activation value distributions at each layer and determine optimal scaling factors.

3. Kernel Auto-Tuning

The optimal CUDA kernel implementation varies based on GPU architecture, input size, and batch size for the same operation. TensorRT actually runs various kernel implementations during engine build and selects the fastest one for the current environment.

Important: TensorRT engines are environment-specific. The engine must be built directly on the Jetson Orin AGX.

4. Dynamic Tensor Memory

Intermediate tensors during inference can have their memory freed once they are no longer needed by subsequent layers. TensorRT analyzes tensor lifecycles to efficiently reuse memory. This optimization is particularly important in Jetson's constrained memory environment.

TensorRT Conversion Workflow

ONNX files are platform-independent and can be exported on a server, but TensorRT engine building must be performed on the target device (Jetson).

DeepStream: End-to-End Video Processing Pipeline

Bottlenecks in Traditional Approach

Comparing the problems with Python + OpenCV + PyTorch pipeline and DeepStream's solutions:

Pipeline Comparison

Issues with the traditional approach on Jetson:

CPU Decoding Bottleneck: ARM CPU has limited decoding performance compared to desktop
Python Overhead: GIL and interpreter overhead
Inefficient Memory Usage: Fails to leverage unified memory advantages
Power Waste: CPU-intensive operations reduce power efficiency

DeepStream's Solution

DeepStream maximizes utilization of Jetson's unified memory and hardware accelerators. All processing occurs on the GPU, and hardware accelerators work seamlessly without memory copying.

Jetson-Specific Hardware Accelerator Utilization

Jetson Orin AGX has several built-in hardware accelerators beyond the GPU:

Hardware	Role	DeepStream Usage
NVDEC	Video decoding	H.264/H.265 hardware decoding
NVENC	Video encoding	Result video saving/streaming
VIC	Vision Image Compositor	Image resize, color conversion
DLA	Deep Learning Accelerator	Additional AI inference (GPU offload)

DeepStream automatically utilizes all this hardware to distribute GPU load.

Multi-Stream Processing

A single DeepStream pipeline can process multiple camera streams simultaneously. Combined with Jetson's power efficiency, you can build multi-camera systems with low power consumption.

DLA (Deep Learning Accelerator) Utilization

Jetson Orin AGX has 2 built-in DLAs. DLA operates independently of the GPU as an AI accelerator. Running certain layers on DLA frees GPU resources for other tasks.

GPU + DLA Parallel Utilization

Checking DLA support:

/usr/src/tensorrt/bin/trtexec \
    --onnx=yolo.onnx \
    --useDLACore=0 \
    --allowGPUFallback \
    --fp16

DeepStream Pipeline Structure

Built on a GStreamer-based plugin system:

DeepStream Pipeline Architecture

Element	Role	Jetson-Specific Features
Source	Input source	Direct CSI camera connection support (nvarguscamerasrc)
Decoder	Video decoding	NVDEC hardware acceleration (nvv4l2decoder)
Streammux	Stream multiplexer	Efficient batch composition in unified memory
PGIE	Primary GPU Inference	TensorRT + DLA selectable
SGIE	Secondary Inference	DLA available for secondary classification
Tracker	Object tracking	GPU-accelerated trackers (NvDCF, IOU, etc.)
OSD	On-Screen Display	GPU-based overlay rendering
Sink	Output	nvoverlaysink (Jetson display optimized)

Case Study: Plastic Sorting System

Technical Challenges

We needed to classify plastics in real-time on a conveyor belt at a recycling facility and send signals to an automated sorting system. Based on classification results, a delta robot separates plastics into appropriate collection bins.

Low Latency: Fast conveyor speed requires minimal delay from detection to robot action
High Throughput: All objects must be detected even when densely packed
Lighting Variation Handling: Industrial lighting conditions are inconsistent
Similar Appearance: PP, PS, and PE are often difficult to distinguish visually
Edge Resource Constraints: Target performance must be achieved within limited computational resources

Performance Comparison Results

Experimental Environment

Item	Specification
Device	NVIDIA Jetson Orin AGX 64GB
JetPack	6.0
TensorRT	8.6.2
DeepStream	6.4
Model	YOLOv8s
Input Resolution	640x480
Power Mode	MAXN (60W)

Inference Speed Comparison

Environment	Inference Time	FPS	Speed Improvement
PyTorch (Python, FP32)	15ms	67 fps	Baseline
TensorRT (Python, FP16)	7ms	142 fps	2.1x

Accuracy Comparison

Accuracy changes with precision conversion:

Environment	mAP@0.5	mAP@0.5:0.95	Accuracy Change
PyTorch (FP32)	99.2%	87.5%	Baseline
TensorRT (FP16)	99.1%	87.3%	-0.1%
TensorRT (INT8)	98.9%	86.1%	-0.3%

FP16 conversion results in only 0.1% accuracy loss, which is practically negligible.

Per-Class Detection Performance

Class	Precision	Recall	AP@0.5
PP (Polypropylene)	99.3%	98.9%	99.1%
PS (Polystyrene)	98.8%	99.2%	99.0%
PE (Polyethylene)	99.1%	99.4%	99.3%

Test Conditions

Test Period: 2 weeks continuous operation
Training Images: Approximately 800,000
Conveyor Conditions: Approximately 0.5m/s

Insights from the Optimization Process

TensorRT Conversion Considerations

1. Always Build the Engine on Jetson

TensorRT engines are hardware-dependent. Engines built on a server will not work on Jetson.

2. Dynamic Shape vs Static Shape

In Jetson's constrained memory environment, Static Shape is more efficient. Fixed input sizes optimize memory allocation.

3. Consider DLA Utilization

Running some layers on DLA can save GPU resources. However, not all layers support DLA, so compatibility verification is necessary.

DeepStream Implementation Considerations

1. Jetson Power Mode Settings

Jetson Orin AGX supports multiple power modes:

# Maximum performance mode
sudo nvpmodel -m 0  # MAXN mode

# Check power mode
nvpmodel -q

2. Tracker Selection

Tracker	Characteristics	Jetson Suitability
IOU Tracker	Simple and fast	Highly suitable (low overhead)
NvDCF	GPU accelerated, high accuracy	Suitable (uses additional GPU resources)
DeepSORT	Re-ID based	Caution (requires additional model, resource intensive)

In conveyor environments, object movement direction is consistent, so IOU Tracker was sufficient.

Conclusion

Quantitative Results

Metric	PyTorch Baseline	After Optimization	Improvement
Inference Speed	15ms	7ms	46% reduction
FPS	67 fps	142 fps	215% improvement
Accuracy (mAP@0.5)	99.2%	99.1%	-0.1% (negligible)

Qualitative Results

Edge-Independent Operation: Fully autonomous on-site operation without network dependency
Improved Stability: Freedom from Python environment memory leaks and GIL issues
Power Efficiency: Low-power operation
Easy Maintenance: Configuration file-based pipeline makes model replacement simple

Key Takeaways

Jetson is not a server: Code and models that work on servers will not perform well as-is. Optimization for the Jetson environment is essential.
Leverage unified memory: Jetson's biggest advantage, unified memory, pairs well with zero-copy pipelines like DeepStream.
Utilize all hardware accelerators: Leverage not just the GPU, but also NVDEC, NVENC, and DLA for excellent end-to-end performance.
Set power mode appropriately: MAXN mode is not always optimal. Consider heat and power constraints. However, in our project, MAXN mode ran without issues even during extended operation in a hot waste processing facility during summer.

Jetson Orin AGX proved to be a suitable platform for industrial environments requiring real-time AI inference at the edge. Through TensorRT and DeepStream, we achieved near-server-level performance within constrained resources. Its value particularly shone in environments like conveyor plastic sorting where both latency and throughput are critical.

YOLO Model Optimization: Achieving 2x Faster Inference on Jetson Orin AGX with TensorRT and DeepStream

The Problem: Racing Against Milliseconds on the Conveyor​

Why Jetson Orin AGX?​

Essential Requirements for Edge AI​

Industrial Environment Suitability​

Advantages of Unified Memory Architecture​

TensorRT: Inference-Optimized Engine​

PyTorch Limitations​

Key Optimization Techniques​

1. Layer Fusion​

2. Precision Calibration​

3. Kernel Auto-Tuning​

4. Dynamic Tensor Memory​

TensorRT Conversion Workflow​

DeepStream: End-to-End Video Processing Pipeline​

Bottlenecks in Traditional Approach​

DeepStream's Solution​

Jetson-Specific Hardware Accelerator Utilization​

Multi-Stream Processing​

DLA (Deep Learning Accelerator) Utilization​

DeepStream Pipeline Structure​

Case Study: Plastic Sorting System​

Technical Challenges​

Performance Comparison Results​

Experimental Environment​

Inference Speed Comparison​

Accuracy Comparison​

Per-Class Detection Performance​

Test Conditions​

Insights from the Optimization Process​

TensorRT Conversion Considerations​

1. Always Build the Engine on Jetson​

2. Dynamic Shape vs Static Shape​

3. Consider DLA Utilization​

DeepStream Implementation Considerations​

1. Jetson Power Mode Settings​

2. Tracker Selection​

Conclusion​

Quantitative Results​

Qualitative Results​

Key Takeaways​

References​