Threading and Parallelism

Overview

ONNX Runtime provides flexible threading options to optimize performance on multi-core systems. This guide covers thread pool configuration, intra-op and inter-op parallelism, and best practices for concurrent execution.

Threading Architecture

ONNX Runtime supports two threading implementations:

ORT Thread Pool: Custom thread pool implementation (default)
OpenMP: Industry-standard parallel programming framework (opt-in at build time)

The choice is determined at build time using the --use_openmp flag.

Thread Pool Types

Intra-Op Thread Pool

Parallelism within a single operator:

import onnxruntime as ort

session_options = ort.SessionOptions()

# Set intra-op threads (parallelism within ops)
session_options.intra_op_num_threads = 4

session = ort.InferenceSession("model.onnx", session_options)

Use cases:

Matrix multiplications
Convolution operations
Element-wise operations on large tensors

Inter-Op Thread Pool

Parallelism between independent operators:

# Set inter-op threads (parallelism between ops)
session_options.inter_op_num_threads = 2

Use cases:

Models with parallel branches
Independent operations in the graph
Pipeline parallelism

Configuration Examples

CPU-Bound Workloads

import os
import onnxruntime as ort

# Get available CPU cores
num_cores = os.cpu_count()

session_options = ort.SessionOptions()

# Maximize intra-op parallelism
session_options.intra_op_num_threads = num_cores

# Minimize inter-op parallelism to reduce overhead
session_options.inter_op_num_threads = 1

# Use sequential execution for lower overhead
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

session = ort.InferenceSession("model.onnx", session_options)

Models with Parallel Branches

# Balance intra-op and inter-op parallelism
session_options.intra_op_num_threads = 2
session_options.inter_op_num_threads = 4

# Enable parallel execution
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

High-Throughput Server

# Optimize for concurrent requests
session_options.intra_op_num_threads = 1  # Limit per-request threads
session_options.inter_op_num_threads = 1

# Handle concurrency at application level
# Create multiple sessions or use thread pool

Execution Modes

Sequential Execution

session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

Characteristics:

Lower scheduling overhead
Operators execute one at a time
Better for simple, linear graphs
Default mode for most scenarios

Parallel Execution

session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

Characteristics:

Higher parallelism between operators
Better for complex graphs with independent paths
Higher scheduling overhead
Requires inter-op thread pool

C++ API

Basic Configuration

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "threading_example");
Ort::SessionOptions session_options;

// Configure thread pools
session_options.SetIntraOpNumThreads(4);
session_options.SetInterOpNumThreads(2);

// Set execution mode
session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);

Ort::Session session(env, "model.onnx", session_options);

Custom Thread Pool

// Use custom thread pool
auto custom_thread_pool = std::make_unique<MyThreadPool>();
session_options.SetCustomThreadPool(custom_thread_pool.get());

Threading Abstractions for Op Developers

ONNX Runtime provides abstractions for implementing parallel operators:

TryParallelFor

#include "core/platform/threadpool.h"

void MyOp::Compute(OpKernelContext* context) const {
    auto thread_pool = context->GetOperatorThreadPool();
    
    // Parallel loop
    concurrency::ThreadPool::TryParallelFor(
        thread_pool,
        num_iterations,
        cost_per_iteration,
        [&](std::ptrdiff_t begin, std::ptrdiff_t end) {
            // Parallel work
            for (auto i = begin; i < end; ++i) {
                ProcessElement(i);
            }
        }
    );
}

TrySimpleParallelFor

Simplified version for uniform work:

concurrency::ThreadPool::TrySimpleParallelFor(
    thread_pool,
    num_iterations,
    [&](std::ptrdiff_t i) {
        ProcessElement(i);
    }
);

TryBatchParallelFor

For batched operations:

concurrency::ThreadPool::TryBatchParallelFor(
    thread_pool,
    batch_size,
    [&](std::ptrdiff_t batch_idx) {
        ProcessBatch(batch_idx);
    },
    0  // scheduling overhead
);

ShouldParallelize

Check if parallelization is beneficial:

if (concurrency::ThreadPool::ShouldParallelize(thread_pool)) {
    // Use parallel implementation
    ParallelCompute();
} else {
    // Use sequential implementation
    SequentialCompute();
}

DegreeOfParallelism

Get available parallelism:

int num_threads = concurrency::ThreadPool::DegreeOfParallelism(thread_pool);

ParallelSection

Group multiple loops in a single parallel section:

threadpool::ParallelSection ps(thread_pool);

ps.Execute(
    [&]() {
        // First parallel loop
        TryParallelFor(thread_pool, n1, cost1, work1);
    },
    [&]() {
        // Second parallel loop
        TryParallelFor(thread_pool, n2, cost2, work2);
    }
);

This amortizes thread pool entry/exit costs.

OpenMP vs ORT Thread Pool

Building with OpenMP

# Build ONNX Runtime with OpenMP support
./build.sh --config Release --use_openmp

When to Use OpenMP

Advantages:

Industry-standard parallelization
Mature optimization
Good for CPU-intensive ops

Considerations:

May conflict with application-level OpenMP
Less control over thread pool
Build-time decision

When to Use ORT Thread Pool

Advantages:

Full control over threading
No conflicts with application threads
Consistent behavior across platforms
Runtime configuration

Use cases:

Custom threading requirements
Embedding in existing applications
Fine-grained control needed

Best Practices

1. Match Thread Count to Hardware

import os

# Physical cores (better than logical cores)
num_physical_cores = os.cpu_count() // 2  # Approximate

session_options.intra_op_num_threads = num_physical_cores

2. Avoid Over-subscription

# Bad: Over-subscription
session_options.intra_op_num_threads = 32  # On 8-core CPU

# Good: Match available cores
session_options.intra_op_num_threads = 8

3. Start with Sequential Mode

# Start simple
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

# Switch to parallel only if needed
if has_parallel_branches:
    session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

4. Tune for Your Workload

def find_optimal_threads(model_path, input_data):
    """Find optimal thread configuration."""
    results = {}
    
    for num_threads in [1, 2, 4, 8, 16]:
        session_options = ort.SessionOptions()
        session_options.intra_op_num_threads = num_threads
        session_options.inter_op_num_threads = 1
        
        session = ort.InferenceSession(model_path, session_options)
        
        # Benchmark
        latency = benchmark(session, input_data)
        results[num_threads] = latency
        
    return min(results, key=results.get)

5. Set Environment Variables

Control system-level threading:

import os

# Limit OpenMP threads (if OpenMP is used)
os.environ['OMP_NUM_THREADS'] = '4'

# Limit MKL threads (Intel MKL)
os.environ['MKL_NUM_THREADS'] = '4'

# Disable nested parallelism
os.environ['OMP_NESTED'] = 'FALSE'

6. Concurrent Inference

For concurrent requests, limit per-session threads:

import concurrent.futures

# Create session with limited threads
session_options.intra_op_num_threads = 1
session_options.inter_op_num_threads = 1
session = ort.InferenceSession("model.onnx", session_options)

# Handle concurrency at application level
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
    futures = [executor.submit(session.run, None, {"input": data}) 
               for data in batch]
    results = [f.result() for f in futures]

Platform-Specific Considerations

Linux

# Use taskset to pin to specific cores
import subprocess
subprocess.run(["taskset", "-c", "0-3", "python", "inference.py"])

Windows

# Set processor affinity
import os
import psutil

process = psutil.Process(os.getpid())
process.cpu_affinity([0, 1, 2, 3])  # Pin to first 4 cores

macOS

# No direct affinity control, use thread count
session_options.intra_op_num_threads = os.cpu_count()

Troubleshooting

Poor CPU Utilization

Symptoms: Low CPU usage during inference Solutions:

Increase intra-op threads
Enable parallel execution mode
Check for I/O bottlenecks

session_options.intra_op_num_threads = os.cpu_count()
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

Thread Contention

Symptoms: Performance degrades with more threads Solutions:

Reduce thread count
Use sequential execution
Profile for lock contention

session_options.intra_op_num_threads = 4  # Reduce from higher value
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

Inconsistent Performance

Symptoms: High latency variance Solutions:

Fix thread count (don’t use default)
Disable dynamic threading
Pin to physical cores

os.environ['OMP_DYNAMIC'] = 'FALSE'
session_options.intra_op_num_threads = 4  # Fixed value

Important Guidelines for Developers

Do not use #ifdef _OPENMP or #pragma omp directly in operator code.Always use the threading abstractions provided in:

threadpool.h - ThreadPool class
thread_utils.h - Threading utility functions

These abstractions handle both OpenMP and non-OpenMP builds automatically.

Example: Correct Approach

// Good: Use abstractions
#include "core/platform/threadpool.h"

TryParallelFor(thread_pool, n, cost, [&](ptrdiff_t i) {
    Process(i);
});

Example: Incorrect Approach

// Bad: Direct OpenMP usage
#ifdef _OPENMP
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    Process(i);
}
#endif

Performance Tuning Examples

Example 1: Latency-Optimized

# Minimize latency for single request
session_options.intra_op_num_threads = os.cpu_count()
session_options.inter_op_num_threads = 1
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

Example 2: Throughput-Optimized

# Maximize throughput for batch processing
session_options.intra_op_num_threads = 4
session_options.inter_op_num_threads = 2
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

Example 3: Server Deployment

# Balance multiple concurrent requests
session_options.intra_op_num_threads = 2
session_options.inter_op_num_threads = 1

# Use application-level concurrency control

​Overview

​Threading Architecture

​Thread Pool Types

​Intra-Op Thread Pool

​Inter-Op Thread Pool

​Configuration Examples

​CPU-Bound Workloads

​Models with Parallel Branches

​High-Throughput Server

​Execution Modes

​Sequential Execution

​Parallel Execution

​C++ API

​Basic Configuration

​Custom Thread Pool

​Threading Abstractions for Op Developers

​TryParallelFor

​TrySimpleParallelFor

​TryBatchParallelFor

​ShouldParallelize

​DegreeOfParallelism

​ParallelSection

​OpenMP vs ORT Thread Pool

​Building with OpenMP

​When to Use OpenMP

​When to Use ORT Thread Pool

​Best Practices

​1. Match Thread Count to Hardware

​2. Avoid Over-subscription

​3. Start with Sequential Mode

​4. Tune for Your Workload

​5. Set Environment Variables

​6. Concurrent Inference

​Platform-Specific Considerations

​Linux

​Windows

​macOS

​Troubleshooting

​Poor CPU Utilization

​Thread Contention

​Inconsistent Performance

​Important Guidelines for Developers

​Example: Correct Approach

​Example: Incorrect Approach

​Performance Tuning Examples

​Example 1: Latency-Optimized

​Example 2: Throughput-Optimized

​Example 3: Server Deployment

​See Also

Overview

Threading Architecture

Thread Pool Types

Intra-Op Thread Pool

Inter-Op Thread Pool

Configuration Examples

CPU-Bound Workloads

Models with Parallel Branches

High-Throughput Server

Execution Modes

Sequential Execution

Parallel Execution

C++ API

Basic Configuration

Custom Thread Pool

Threading Abstractions for Op Developers

TryParallelFor

TrySimpleParallelFor

TryBatchParallelFor

ShouldParallelize

DegreeOfParallelism

ParallelSection

OpenMP vs ORT Thread Pool

Building with OpenMP

When to Use OpenMP

When to Use ORT Thread Pool

Best Practices

1. Match Thread Count to Hardware

2. Avoid Over-subscription

3. Start with Sequential Mode

4. Tune for Your Workload

5. Set Environment Variables

6. Concurrent Inference

Platform-Specific Considerations

Linux

Windows

macOS

Troubleshooting

Poor CPU Utilization

Thread Contention

Inconsistent Performance

Important Guidelines for Developers

Example: Correct Approach

Example: Incorrect Approach

Performance Tuning Examples

Example 1: Latency-Optimized

Example 2: Throughput-Optimized

Example 3: Server Deployment

See Also