Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
Overview
ONNX Runtime provides flexible threading options to optimize performance on multi-core systems. This guide covers thread pool configuration, intra-op and inter-op parallelism, and best practices for concurrent execution.
Threading Architecture
ONNX Runtime supports two threading implementations:
- ORT Thread Pool: Custom thread pool implementation (default)
- OpenMP: Industry-standard parallel programming framework (opt-in at build time)
The choice is determined at build time using the --use_openmp flag.
Thread Pool Types
Intra-Op Thread Pool
Parallelism within a single operator:
import onnxruntime as ort
session_options = ort.SessionOptions()
# Set intra-op threads (parallelism within ops)
session_options.intra_op_num_threads = 4
session = ort.InferenceSession("model.onnx", session_options)
Use cases:
- Matrix multiplications
- Convolution operations
- Element-wise operations on large tensors
Inter-Op Thread Pool
Parallelism between independent operators:
# Set inter-op threads (parallelism between ops)
session_options.inter_op_num_threads = 2
Use cases:
- Models with parallel branches
- Independent operations in the graph
- Pipeline parallelism
Configuration Examples
CPU-Bound Workloads
import os
import onnxruntime as ort
# Get available CPU cores
num_cores = os.cpu_count()
session_options = ort.SessionOptions()
# Maximize intra-op parallelism
session_options.intra_op_num_threads = num_cores
# Minimize inter-op parallelism to reduce overhead
session_options.inter_op_num_threads = 1
# Use sequential execution for lower overhead
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
session = ort.InferenceSession("model.onnx", session_options)
Models with Parallel Branches
# Balance intra-op and inter-op parallelism
session_options.intra_op_num_threads = 2
session_options.inter_op_num_threads = 4
# Enable parallel execution
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
High-Throughput Server
# Optimize for concurrent requests
session_options.intra_op_num_threads = 1 # Limit per-request threads
session_options.inter_op_num_threads = 1
# Handle concurrency at application level
# Create multiple sessions or use thread pool
Execution Modes
Sequential Execution
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
Characteristics:
- Lower scheduling overhead
- Operators execute one at a time
- Better for simple, linear graphs
- Default mode for most scenarios
Parallel Execution
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
Characteristics:
- Higher parallelism between operators
- Better for complex graphs with independent paths
- Higher scheduling overhead
- Requires inter-op thread pool
C++ API
Basic Configuration
#include <onnxruntime_cxx_api.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "threading_example");
Ort::SessionOptions session_options;
// Configure thread pools
session_options.SetIntraOpNumThreads(4);
session_options.SetInterOpNumThreads(2);
// Set execution mode
session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
Ort::Session session(env, "model.onnx", session_options);
Custom Thread Pool
// Use custom thread pool
auto custom_thread_pool = std::make_unique<MyThreadPool>();
session_options.SetCustomThreadPool(custom_thread_pool.get());
Threading Abstractions for Op Developers
ONNX Runtime provides abstractions for implementing parallel operators:
TryParallelFor
#include "core/platform/threadpool.h"
void MyOp::Compute(OpKernelContext* context) const {
auto thread_pool = context->GetOperatorThreadPool();
// Parallel loop
concurrency::ThreadPool::TryParallelFor(
thread_pool,
num_iterations,
cost_per_iteration,
[&](std::ptrdiff_t begin, std::ptrdiff_t end) {
// Parallel work
for (auto i = begin; i < end; ++i) {
ProcessElement(i);
}
}
);
}
TrySimpleParallelFor
Simplified version for uniform work:
concurrency::ThreadPool::TrySimpleParallelFor(
thread_pool,
num_iterations,
[&](std::ptrdiff_t i) {
ProcessElement(i);
}
);
TryBatchParallelFor
For batched operations:
concurrency::ThreadPool::TryBatchParallelFor(
thread_pool,
batch_size,
[&](std::ptrdiff_t batch_idx) {
ProcessBatch(batch_idx);
},
0 // scheduling overhead
);
ShouldParallelize
Check if parallelization is beneficial:
if (concurrency::ThreadPool::ShouldParallelize(thread_pool)) {
// Use parallel implementation
ParallelCompute();
} else {
// Use sequential implementation
SequentialCompute();
}
DegreeOfParallelism
Get available parallelism:
int num_threads = concurrency::ThreadPool::DegreeOfParallelism(thread_pool);
ParallelSection
Group multiple loops in a single parallel section:
threadpool::ParallelSection ps(thread_pool);
ps.Execute(
[&]() {
// First parallel loop
TryParallelFor(thread_pool, n1, cost1, work1);
},
[&]() {
// Second parallel loop
TryParallelFor(thread_pool, n2, cost2, work2);
}
);
This amortizes thread pool entry/exit costs.
OpenMP vs ORT Thread Pool
Building with OpenMP
# Build ONNX Runtime with OpenMP support
./build.sh --config Release --use_openmp
When to Use OpenMP
Advantages:
- Industry-standard parallelization
- Mature optimization
- Good for CPU-intensive ops
Considerations:
- May conflict with application-level OpenMP
- Less control over thread pool
- Build-time decision
When to Use ORT Thread Pool
Advantages:
- Full control over threading
- No conflicts with application threads
- Consistent behavior across platforms
- Runtime configuration
Use cases:
- Custom threading requirements
- Embedding in existing applications
- Fine-grained control needed
Best Practices
1. Match Thread Count to Hardware
import os
# Physical cores (better than logical cores)
num_physical_cores = os.cpu_count() // 2 # Approximate
session_options.intra_op_num_threads = num_physical_cores
2. Avoid Over-subscription
# Bad: Over-subscription
session_options.intra_op_num_threads = 32 # On 8-core CPU
# Good: Match available cores
session_options.intra_op_num_threads = 8
3. Start with Sequential Mode
# Start simple
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
# Switch to parallel only if needed
if has_parallel_branches:
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
4. Tune for Your Workload
def find_optimal_threads(model_path, input_data):
"""Find optimal thread configuration."""
results = {}
for num_threads in [1, 2, 4, 8, 16]:
session_options = ort.SessionOptions()
session_options.intra_op_num_threads = num_threads
session_options.inter_op_num_threads = 1
session = ort.InferenceSession(model_path, session_options)
# Benchmark
latency = benchmark(session, input_data)
results[num_threads] = latency
return min(results, key=results.get)
5. Set Environment Variables
Control system-level threading:
import os
# Limit OpenMP threads (if OpenMP is used)
os.environ['OMP_NUM_THREADS'] = '4'
# Limit MKL threads (Intel MKL)
os.environ['MKL_NUM_THREADS'] = '4'
# Disable nested parallelism
os.environ['OMP_NESTED'] = 'FALSE'
6. Concurrent Inference
For concurrent requests, limit per-session threads:
import concurrent.futures
# Create session with limited threads
session_options.intra_op_num_threads = 1
session_options.inter_op_num_threads = 1
session = ort.InferenceSession("model.onnx", session_options)
# Handle concurrency at application level
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
futures = [executor.submit(session.run, None, {"input": data})
for data in batch]
results = [f.result() for f in futures]
Linux
# Use taskset to pin to specific cores
import subprocess
subprocess.run(["taskset", "-c", "0-3", "python", "inference.py"])
Windows
# Set processor affinity
import os
import psutil
process = psutil.Process(os.getpid())
process.cpu_affinity([0, 1, 2, 3]) # Pin to first 4 cores
macOS
# No direct affinity control, use thread count
session_options.intra_op_num_threads = os.cpu_count()
Troubleshooting
Poor CPU Utilization
Symptoms: Low CPU usage during inference
Solutions:
- Increase intra-op threads
- Enable parallel execution mode
- Check for I/O bottlenecks
session_options.intra_op_num_threads = os.cpu_count()
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
Thread Contention
Symptoms: Performance degrades with more threads
Solutions:
- Reduce thread count
- Use sequential execution
- Profile for lock contention
session_options.intra_op_num_threads = 4 # Reduce from higher value
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
Symptoms: High latency variance
Solutions:
- Fix thread count (don’t use default)
- Disable dynamic threading
- Pin to physical cores
os.environ['OMP_DYNAMIC'] = 'FALSE'
session_options.intra_op_num_threads = 4 # Fixed value
Important Guidelines for Developers
Do not use #ifdef _OPENMP or #pragma omp directly in operator code.Always use the threading abstractions provided in:
threadpool.h - ThreadPool class
thread_utils.h - Threading utility functions
These abstractions handle both OpenMP and non-OpenMP builds automatically.
Example: Correct Approach
// Good: Use abstractions
#include "core/platform/threadpool.h"
TryParallelFor(thread_pool, n, cost, [&](ptrdiff_t i) {
Process(i);
});
Example: Incorrect Approach
// Bad: Direct OpenMP usage
#ifdef _OPENMP
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
Process(i);
}
#endif
Example 1: Latency-Optimized
# Minimize latency for single request
session_options.intra_op_num_threads = os.cpu_count()
session_options.inter_op_num_threads = 1
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
Example 2: Throughput-Optimized
# Maximize throughput for batch processing
session_options.intra_op_num_threads = 4
session_options.inter_op_num_threads = 2
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
Example 3: Server Deployment
# Balance multiple concurrent requests
session_options.intra_op_num_threads = 2
session_options.inter_op_num_threads = 1
# Use application-level concurrency control
See Also