Execution Providers (EPs) are the interfaces that enable ONNX Runtime to execute models on different hardware platforms. They provide hardware-specific optimizations and acceleration capabilities.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
What are Execution Providers?
Execution Providers abstract the hardware-specific implementation details, allowing ONNX Runtime to:- Accelerate inference using specialized hardware (GPUs, NPUs, etc.)
- Optimize operators for specific hardware architectures
- Manage memory efficiently on target devices
- Handle data transfer between different memory spaces
Think of Execution Providers as “backends” or “device drivers” for ONNX Runtime, similar to how TensorFlow has device placements or PyTorch has device types.
Architecture Overview
How EPs Work
Graph Partitioning
ONNX Runtime partitions the graph across available EPs based on their capabilities
Available Execution Providers
CPUExecutionProvider
The default execution provider, always available:Platforms
- Windows, Linux, macOS
- x86_64, ARM64, ARM32
- WebAssembly
Features
- Comprehensive operator coverage
- SIMD optimizations (SSE, AVX, NEON)
- Multi-threading support
- Reference implementation
CUDAExecutionProvider
NVIDIA GPU acceleration using CUDA:- Python
- C++
CUDA Provider Options
CUDA Provider Options
| Option | Description | Default |
|---|---|---|
device_id | GPU device ID | 0 |
gpu_mem_limit | Maximum GPU memory usage (bytes) | SIZE_MAX |
arena_extend_strategy | Memory arena growth strategy | kNextPowerOfTwo |
cudnn_conv_algo_search | cuDNN convolution algorithm search | EXHAUSTIVE |
do_copy_in_default_stream | Use default CUDA stream for copies | True |
cudnn_conv_use_max_workspace | Use maximum workspace for cuDNN | True |
TensorRTExecutionProvider
Optimized inference using NVIDIA TensorRT:DirectMLExecutionProvider
Hardware acceleration on Windows using DirectML:Advantages
- Works with any DirectX 12 GPU
- AMD, Intel, NVIDIA support
- Built into Windows
Use Cases
- Windows client applications
- Cross-vendor GPU support
- Integrated graphics
CoreMLExecutionProvider
Apple Silicon and iOS acceleration:Additional Execution Providers
OpenVINO (Intel)
OpenVINO (Intel)
Intel CPU, GPU, VPU, and FPGA acceleration:
NNAPI (Android)
NNAPI (Android)
Android Neural Networks API:
ACL (ARM)
ACL (ARM)
ARM Compute Library for ARM CPUs:
ROCM (AMD)
ROCM (AMD)
AMD GPU acceleration:
EP Selection and Fallback
Provider Priority
Execution providers are tried in the order specified:If a provider cannot execute a node, it falls back to the next provider in the list. CPU is typically the last fallback.
Checking Active Providers
Graph Partitioning
ONNX Runtime partitions the graph across execution providers:Capability Query
Each EP implementsGetCapability() to report which nodes it can execute:
Data Transfer
Execution providers manage data transfer between memory spaces:Memory Locations
- CPU memory: Host memory accessible by CPU
- GPU memory: Device memory on GPU
- Shared memory: Accessible by both CPU and GPU
IOBinding for Efficient Transfer
Use IOBinding to avoid unnecessary data copies:Custom Execution Providers
You can implement custom execution providers for specialized hardware:Building custom execution providers requires compiling ONNX Runtime from source. See the Custom Operators Guide for details.
Performance Considerations
Provider Selection
Provider Selection
Choose the right provider for your hardware:
- CPU: Good for small models, low latency, or no GPU available
- CUDA: Best for NVIDIA GPUs, good operator coverage
- TensorRT: Maximum performance on NVIDIA GPUs, longer warmup
- DirectML: Cross-vendor on Windows, good for client applications
Graph Partitioning Overhead
Graph Partitioning Overhead
Minimize data transfer between providers:
- Prefer EPs that can execute entire subgraphs
- CPU-GPU transfers are expensive
- Use IOBinding to reduce copies
Memory Management
Memory Management
Configure memory limits appropriately:
Warmup Runs
Warmup Runs
First inference may be slower due to:
- Kernel compilation
- Memory allocation
- Engine building (TensorRT)
Troubleshooting
Provider Not Available
Mixed Precision Issues
Some providers support different precisions:Memory Errors
Reduce memory usage:Best Practices
Always Include CPU
Always include CPUExecutionProvider as fallback:
Test on Target Hardware
Performance varies significantly across hardware. Always profile on deployment targets.
Use IOBinding
Use IOBinding for better performance when doing multiple inferences.
Cache Engines
Enable engine caching for TensorRT:
Next Steps
Sessions
Learn about InferenceSession configuration and management
Graph Optimizations
Understand how graph optimizations improve performance
Performance Tuning
Optimize inference performance for your use case
Quantization
Reduce model size and improve speed with quantization