Execution Providers - ONNX Runtime

Execution Providers (EPs) are the interfaces that enable ONNX Runtime to execute models on different hardware platforms. They provide hardware-specific optimizations and acceleration capabilities.

What are Execution Providers?

Execution Providers abstract the hardware-specific implementation details, allowing ONNX Runtime to:

Accelerate inference using specialized hardware (GPUs, NPUs, etc.)
Optimize operators for specific hardware architectures
Manage memory efficiently on target devices
Handle data transfer between different memory spaces

Think of Execution Providers as “backends” or “device drivers” for ONNX Runtime, similar to how TensorFlow has device placements or PyTorch has device types.

Architecture Overview

How EPs Work

Registration

Execution providers are registered with the session during initialization

Capability Query

Each EP reports which nodes/subgraphs it can execute via GetCapability()

Graph Partitioning

ONNX Runtime partitions the graph across available EPs based on their capabilities

Kernel Execution

Each node is executed by its assigned EP using hardware-specific kernels

Available Execution Providers

CPUExecutionProvider

The default execution provider, always available:

Platforms

Windows, Linux, macOS
x86_64, ARM64, ARM32
WebAssembly

Features

Comprehensive operator coverage
SIMD optimizations (SSE, AVX, NEON)
Multi-threading support
Reference implementation

import onnxruntime as ort

# CPU is used by default
session = ort.InferenceSession("model.onnx")

# Explicit configuration
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4
session = ort.InferenceSession(
    "model.onnx",
    sess_options,
    providers=['CPUExecutionProvider']
)

CUDAExecutionProvider

NVIDIA GPU acceleration using CUDA:

Python
C++

import onnxruntime as ort

# Use CUDA with default settings
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Configure CUDA options
cuda_options = {
    'device_id': 0,
    'arena_extend_strategy': 'kNextPowerOfTwo',
    'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
    'cudnn_conv_algo_search': 'EXHAUSTIVE',
    'do_copy_in_default_stream': True,
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('CUDAExecutionProvider', cuda_options),
               'CPUExecutionProvider']
)

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
Ort::SessionOptions session_options;

// Append CUDA execution provider
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
cuda_options.arena_extend_strategy = OrtArenaExtendStrategy::kNextPowerOfTwo;
cuda_options.gpu_mem_limit = 2ULL * 1024 * 1024 * 1024;

session_options.AppendExecutionProvider_CUDA(cuda_options);

Ort::Session session(env, L"model.onnx", session_options);

CUDA Provider Options

Option	Description	Default
`device_id`	GPU device ID	0
`gpu_mem_limit`	Maximum GPU memory usage (bytes)	SIZE_MAX
`arena_extend_strategy`	Memory arena growth strategy	`kNextPowerOfTwo`
`cudnn_conv_algo_search`	cuDNN convolution algorithm search	`EXHAUSTIVE`
`do_copy_in_default_stream`	Use default CUDA stream for copies	True
`cudnn_conv_use_max_workspace`	Use maximum workspace for cuDNN	True

TensorRTExecutionProvider

Optimized inference using NVIDIA TensorRT:

import onnxruntime as ort

trt_options = {
    'device_id': 0,
    'trt_max_workspace_size': 2 * 1024 * 1024 * 1024,  # 2GB
    'trt_fp16_enable': True,  # Enable FP16 precision
    'trt_int8_enable': False,
    'trt_engine_cache_enable': True,
    'trt_engine_cache_path': './trt_cache',
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('TensorrtExecutionProvider', trt_options),
               'CUDAExecutionProvider',
               'CPUExecutionProvider']
)

TensorRT builds optimized engines at runtime. The first inference run will be slower as engines are built and cached.

DirectMLExecutionProvider

Hardware acceleration on Windows using DirectML:

import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider', 'CPUExecutionProvider']
)

Advantages

Works with any DirectX 12 GPU
AMD, Intel, NVIDIA support
Built into Windows

Use Cases

Windows client applications
Cross-vendor GPU support
Integrated graphics

CoreMLExecutionProvider

Apple Silicon and iOS acceleration:

import onnxruntime as ort

coreml_options = {
    'MLComputeUnits': 'ALL',  # CPU_AND_GPU, CPU_ONLY, or ALL
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('CoreMLExecutionProvider', coreml_options),
               'CPUExecutionProvider']
)

Additional Execution Providers

OpenVINO (Intel)

Intel CPU, GPU, VPU, and FPGA acceleration:

openvino_options = {
    'device_type': 'CPU_FP32',  # CPU_FP32, GPU_FP32, GPU_FP16, etc.
    'num_of_threads': 8,
}
session = ort.InferenceSession(
    "model.onnx",
    providers=[('OpenVINOExecutionProvider', openvino_options)]
)

NNAPI (Android)

Android Neural Networks API:

session = ort.InferenceSession(
    "model.onnx",
    providers=['NnapiExecutionProvider', 'CPUExecutionProvider']
)

ACL (ARM)

ARM Compute Library for ARM CPUs:

session = ort.InferenceSession(
    "model.onnx",
    providers=['AclExecutionProvider', 'CPUExecutionProvider']
)

ROCM (AMD)

AMD GPU acceleration:

rocm_options = {
    'device_id': 0,
    'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
}
session = ort.InferenceSession(
    "model.onnx",
    providers=[('ROCMExecutionProvider', rocm_options)]
)

EP Selection and Fallback

Provider Priority

Execution providers are tried in the order specified:

# TensorRT tried first, then CUDA, then CPU
session = ort.InferenceSession(
    "model.onnx",
    providers=[
        'TensorrtExecutionProvider',
        'CUDAExecutionProvider',
        'CPUExecutionProvider'
    ]
)

If a provider cannot execute a node, it falls back to the next provider in the list. CPU is typically the last fallback.

Checking Active Providers

import onnxruntime as ort

# Check available providers
print("Available providers:", ort.get_available_providers())

# Check session providers
session = ort.InferenceSession("model.onnx")
print("Session providers:", session.get_providers())

Graph Partitioning

ONNX Runtime partitions the graph across execution providers:

Capability Query

Each EP implements GetCapability() to report which nodes it can execute:

// Simplified EP capability interface
class IExecutionProvider {
  virtual std::vector<std::unique_ptr<ComputeCapability>>
  GetCapability(
    const GraphViewer& graph_viewer,
    const IKernelLookup& kernel_lookup
  ) const;
};

Use verbose logging to see how nodes are partitioned:

sess_options = ort.SessionOptions()
sess_options.log_severity_level = 0  # Verbose
session = ort.InferenceSession("model.onnx", sess_options)

Data Transfer

Execution providers manage data transfer between memory spaces:

Memory Locations

CPU memory: Host memory accessible by CPU
GPU memory: Device memory on GPU
Shared memory: Accessible by both CPU and GPU

IOBinding for Efficient Transfer

Use IOBinding to avoid unnecessary data copies:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])

# Create IOBinding
io_binding = session.io_binding()

# Bind input to GPU
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
io_binding.bind_cpu_input('input', input_data)

# Bind output to GPU
io_binding.bind_output('output', 'cuda')

# Run on GPU
session.run_with_iobinding(io_binding)

# Get output (still on GPU)
output = io_binding.copy_outputs_to_cpu()[0]

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
io_binding = session.io_binding()

# Pre-allocate output buffer
output_shape = session.get_outputs()[0].shape
output_buffer = np.empty(output_shape, dtype=np.float32)

# Bind to pre-allocated memory
io_binding.bind_cpu_input('input', input_data)
io_binding.bind_output('output', 'cpu', output_buffer)

session.run_with_iobinding(io_binding)
# Output is now in output_buffer, no copy needed

Custom Execution Providers

You can implement custom execution providers for specialized hardware:

// Simplified custom EP structure
class CustomExecutionProvider : public IExecutionProvider {
public:
  CustomExecutionProvider(const std::string& type, OrtDevice device)
      : IExecutionProvider(type, device) {}
  
  // Report which nodes this EP can execute
  std::vector<std::unique_ptr<ComputeCapability>>
  GetCapability(const GraphViewer& graph,
                const IKernelLookup& kernel_lookup) const override {
    // Inspect graph and return capability
  }
  
  // Get kernel registry
  std::shared_ptr<KernelRegistry> GetKernelRegistry() const override {
    return kernel_registry_;
  }
  
  // Data transfer implementation
  std::unique_ptr<IDataTransfer> GetDataTransfer() const override {
    return std::make_unique<CustomDataTransfer>();
  }
};

Building custom execution providers requires compiling ONNX Runtime from source. See the Custom Operators Guide for details.

Performance Considerations

Provider Selection

Choose the right provider for your hardware:

CPU: Good for small models, low latency, or no GPU available
CUDA: Best for NVIDIA GPUs, good operator coverage
TensorRT: Maximum performance on NVIDIA GPUs, longer warmup
DirectML: Cross-vendor on Windows, good for client applications

Graph Partitioning Overhead

Minimize data transfer between providers:

Prefer EPs that can execute entire subgraphs
CPU-GPU transfers are expensive
Use IOBinding to reduce copies

Memory Management

Configure memory limits appropriately:

cuda_options = {
    'gpu_mem_limit': 4 * 1024 * 1024 * 1024,  # 4GB
    'arena_extend_strategy': 'kSameAsRequested',  # More predictable
}

Warmup Runs

First inference may be slower due to:

Kernel compilation
Memory allocation
Engine building (TensorRT)

Run warmup inferences before measuring performance:

# Warmup
for _ in range(10):
    session.run(None, {"input": dummy_input})

# Now measure performance

Troubleshooting

Provider Not Available

import onnxruntime as ort

if 'CUDAExecutionProvider' not in ort.get_available_providers():
    print("CUDA provider not available")
    print("Available:", ort.get_available_providers())
    # Fallback to CPU

Mixed Precision Issues

Some providers support different precisions:

# TensorRT with FP16
trt_options = {
    'trt_fp16_enable': True,
    'trt_strict_type_constraints': False,  # Allow mixed precision
}

FP16 may produce different results than FP32. Always validate accuracy when using reduced precision.

Memory Errors

Reduce memory usage:

cuda_options = {
    'gpu_mem_limit': 1 * 1024 * 1024 * 1024,  # Reduce limit
    'arena_extend_strategy': 'kSameAsRequested',
}

Best Practices

Always Include CPU

Always include CPUExecutionProvider as fallback:

providers=['CUDAExecutionProvider', 'CPUExecutionProvider']

Test on Target Hardware

Performance varies significantly across hardware. Always profile on deployment targets.

Use IOBinding

Use IOBinding for better performance when doing multiple inferences.

Cache Engines

Enable engine caching for TensorRT:

{'trt_engine_cache_enable': True}

Next Steps

Sessions

Learn about InferenceSession configuration and management

Graph Optimizations

Understand how graph optimizations improve performance

Performance Tuning

Optimize inference performance for your use case

Quantization

Reduce model size and improve speed with quantization

​What are Execution Providers?

​Architecture Overview

​How EPs Work

​Available Execution Providers

​CPUExecutionProvider

Platforms

Features

​CUDAExecutionProvider

​TensorRTExecutionProvider

​DirectMLExecutionProvider

Advantages

Use Cases

​CoreMLExecutionProvider

​Additional Execution Providers

​EP Selection and Fallback

​Provider Priority

​Checking Active Providers

​Graph Partitioning

​Capability Query

​Data Transfer

​Memory Locations

​IOBinding for Efficient Transfer

​Custom Execution Providers

​Performance Considerations

​Troubleshooting

​Provider Not Available

​Mixed Precision Issues

​Memory Errors

​Best Practices

Always Include CPU

Test on Target Hardware

Use IOBinding

Cache Engines

​Next Steps

Sessions

Graph Optimizations

Performance Tuning

Quantization

What are Execution Providers?

Architecture Overview

How EPs Work

Available Execution Providers

CPUExecutionProvider

CUDAExecutionProvider

TensorRTExecutionProvider

DirectMLExecutionProvider

CoreMLExecutionProvider

Additional Execution Providers

EP Selection and Fallback

Provider Priority

Checking Active Providers

Graph Partitioning

Capability Query

Data Transfer

Memory Locations

IOBinding for Efficient Transfer

Custom Execution Providers

Performance Considerations

Troubleshooting

Provider Not Available

Mixed Precision Issues

Memory Errors

Best Practices

Next Steps