Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
Overview
ONNX Runtime provides extensive performance tuning options to optimize model inference and training. This guide covers the key configuration options and best practices for achieving optimal performance.
Session Configuration
Creating an Optimized Session
Use SessionOptions to configure performance settings:
import onnxruntime as ort
# Create session options
session_options = ort.SessionOptions()
# Set graph optimization level
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Enable profiling
session_options.enable_profiling = True
# Create inference session
session = ort.InferenceSession("model.onnx", session_options)
Graph Optimization Levels
ONNX Runtime provides different optimization levels:
- ORT_DISABLE_ALL: No optimizations applied
- ORT_ENABLE_BASIC: Basic optimizations like constant folding, redundant node elimination
- ORT_ENABLE_EXTENDED: Extended optimizations including node fusion, layout optimizations
- ORT_ENABLE_ALL: All available optimizations (recommended for production)
// C++ API
SessionOptions session_options;
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
Execution Providers
Selecting Execution Providers
Execution providers enable hardware acceleration:
# CUDA GPU acceleration
session_options.append_execution_provider('CUDAExecutionProvider', {
'device_id': 0,
'arena_extend_strategy': 'kNextPowerOfTwo',
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
'cudnn_conv_algo_search': 'EXHAUSTIVE',
})
# TensorRT acceleration
session_options.append_execution_provider('TensorrtExecutionProvider', {
'device_id': 0,
'trt_max_workspace_size': 2147483648,
'trt_fp16_enable': True,
})
# CPU fallback
session_options.append_execution_provider('CPUExecutionProvider')
Common Execution Provider Options
CUDA Provider
device_id: GPU device ID
arena_extend_strategy: Memory allocation strategy
gpu_mem_limit: Maximum GPU memory usage
cudnn_conv_algo_search: Algorithm selection (DEFAULT, EXHAUSTIVE, HEURISTIC)
TensorRT Provider
trt_fp16_enable: Enable FP16 precision
trt_int8_enable: Enable INT8 quantization
trt_max_workspace_size: Maximum workspace size for TensorRT
trt_engine_cache_enable: Cache compiled engines
Intra-Op and Inter-Op Parallelism
Thread Configuration
Control parallelism for optimal CPU utilization:
# Intra-op threads: parallelism within ops
session_options.intra_op_num_threads = 4
# Inter-op threads: parallelism between ops
session_options.inter_op_num_threads = 2
# Execution mode
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
# or
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
Execution Modes
- ORT_SEQUENTIAL: Operators are executed sequentially (lower overhead)
- ORT_PARALLEL: Operators can be executed in parallel (better for models with independent ops)
Model Optimization
Offline Optimization
Save optimized models for faster startup:
session_options.optimized_model_filepath = "optimized_model.onnx"
session = ort.InferenceSession("model.onnx", session_options)
Optimization Configuration
Fine-tune specific optimizations:
# Disable specific optimizations
session_options.add_free_dimension_override_by_name("batch_size", 1)
# Enable model serialization after optimization
session_options.optimized_model_filepath = "optimized.onnx"
Memory Management
Memory Pattern Optimization
# Enable memory pattern optimization
session_options.enable_mem_pattern = True
# Enable CPU memory arena
session_options.enable_cpu_mem_arena = True
Arena Configuration
// C++ API - Configure memory arena
OrtArenaCfg* arena_cfg;
CreateArenaCfg(0, -1, -1, -1, &arena_cfg);
CreateSessionOptionsWithArenaCfg(session_options, arena_cfg);
I/O Binding for Zero-Copy
Reduce memory copies with I/O binding:
import numpy as np
# Create I/O binding
io_binding = session.io_binding()
# Bind input
input_data = np.array([[1.0, 2.0]], dtype=np.float32)
io_binding.bind_cpu_input('input', input_data)
# Bind output
io_binding.bind_output('output')
# Run with binding
session.run_with_iobinding(io_binding)
outputs = io_binding.copy_outputs_to_cpu()
GPU I/O Binding
# Bind input on GPU
io_binding.bind_input(
name='input',
device_type='cuda',
device_id=0,
element_type=np.float32,
shape=input_data.shape,
buffer_ptr=input_ptr # CUDA device pointer
)
# Bind output on GPU
io_binding.bind_output(
name='output',
device_type='cuda',
device_id=0
)
Profiling and Analysis
Enable Profiling
session_options.enable_profiling = True
session = ort.InferenceSession("model.onnx", session_options)
# Run inference
session.run(None, {"input": input_data})
# Get profile file
profile_file = session.end_profiling()
print(f"Profile saved to: {profile_file}")
The profile file contains:
- Operator execution times
- Memory usage patterns
- Data transfer overhead
- Kernel launch times
Best Practices
1. Choose the Right Execution Provider
- Use GPU providers (CUDA, TensorRT, DirectML) for compute-intensive models
- Use CPU provider for smaller models or edge devices
- Test multiple providers to find the best fit
2. Optimize Thread Configuration
import os
# For CPU-bound workloads
num_cores = os.cpu_count()
session_options.intra_op_num_threads = num_cores
session_options.inter_op_num_threads = 1
3. Use I/O Binding
- Reduces memory allocation overhead
- Enables zero-copy for GPU inference
- Best for high-throughput scenarios
4. Enable All Optimizations
# Maximum optimization
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.enable_mem_pattern = True
session_options.enable_cpu_mem_arena = True
5. Warm Up the Session
# Run a few warm-up iterations
for _ in range(5):
session.run(None, {"input": dummy_input})
# Now measure actual performance
start = time.time()
for _ in range(100):
session.run(None, {"input": input_data})
end = time.time()
Issue: Slow First Inference
Solution: Model optimization and kernel compilation happen on first run. Use warm-up iterations or save optimized models.
Issue: High Memory Usage
Solution:
- Limit GPU memory with
gpu_mem_limit
- Use smaller batch sizes
- Enable memory pattern optimization
Issue: Poor CPU Utilization
Solution:
- Adjust
intra_op_num_threads and inter_op_num_threads
- Try different execution modes
- Build ONNX Runtime with OpenMP support
Advanced Configuration
Custom Execution Provider Configuration
# Advanced CUDA configuration
cuda_options = {
'device_id': 0,
'arena_extend_strategy': 'kSameAsRequested',
'gpu_mem_limit': 4 * 1024 * 1024 * 1024,
'cudnn_conv_algo_search': 'HEURISTIC',
'do_copy_in_default_stream': True,
'cudnn_conv_use_max_workspace': True,
}
session_options.append_execution_provider('CUDAExecutionProvider', cuda_options)
Session Configuration File
# Load configuration from file
import json
with open('session_config.json', 'r') as f:
config = json.load(f)
session_options.intra_op_num_threads = config['intra_op_threads']
session_options.inter_op_num_threads = config['inter_op_threads']
session_options.graph_optimization_level = getattr(
ort.GraphOptimizationLevel,
config['optimization_level']
)
See Also