Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
CUDA Execution Provider
The CUDA Execution Provider enables GPU acceleration for ONNX Runtime on NVIDIA GPUs using CUDA and cuDNN libraries.
When to Use CUDA EP
Use the CUDA Execution Provider when:
- You have NVIDIA GPUs (compute capability 6.0+)
- You need general-purpose GPU acceleration
- You want quick setup without TensorRT complexity
- You’re developing and testing before optimizing with TensorRT
- Your model has operators not supported by TensorRT
Prerequisites
Hardware Requirements
- NVIDIA GPU with compute capability 6.0 or higher
- Recommended: 4GB+ GPU memory
Software Requirements
- CUDA Toolkit: 11.8 or 12.x
- cuDNN: 8.x (matching your CUDA version)
- ONNX Runtime GPU package
Installation
Python
# Install ONNX Runtime with GPU support
pip install onnxruntime-gpu
# Verify CUDA is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'CUDAExecutionProvider'
C++
Download the GPU build from the ONNX Runtime releases page:
# Linux
wget https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-linux-x64-gpu-{version}.tgz
tar -xzf onnxruntime-linux-x64-gpu-{version}.tgz
# Install NuGet packages
dotnet add package Microsoft.ML.OnnxRuntime.Gpu
Basic Usage
Python
import onnxruntime as ort
import numpy as np
# Create session with CUDA provider
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Run inference
results = session.run(None, {input_name: x})
C++
#include <onnxruntime_cxx_api.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "CUDAExample");
Ort::SessionOptions session_options;
// Configure CUDA provider
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
cuda_options.arena_extend_strategy = OrtArenaExtendStrategy::kNextPowerOfTwo;
cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::EXHAUSTIVE;
cuda_options.do_copy_in_default_stream = true;
session_options.AppendExecutionProvider_CUDA(cuda_options);
// Create session
Ort::Session session(env, "model.onnx", session_options);
// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr},
input_names.data(),
&input_tensor, 1,
output_names.data(), 1);
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CUDA(0);
using var session = new InferenceSession("model.onnx", sessionOptions);
var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;
var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor(name, tensor) };
using var results = session.Run(inputs);
Configuration Options
Python Provider Options
import onnxruntime as ort
session = ort.InferenceSession(
"model.onnx",
providers=[
('CUDAExecutionProvider', {
# GPU device ID (0, 1, 2, etc.)
'device_id': 0,
# Memory arena configuration
'arena_extend_strategy': 'kNextPowerOfTwo', # or 'kSameAsRequested'
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB limit
# cuDNN convolution algorithm search
'cudnn_conv_algo_search': 'EXHAUSTIVE', # or 'HEURISTIC', 'DEFAULT'
# Use default stream for memory copies
'do_copy_in_default_stream': True,
# Enable CUDA graph capture for better performance
'enable_cuda_graph': False,
# Use TF32 for matrix operations (Ampere GPUs)
'use_tf32': True,
# Prefer NHWC layout for better performance
'prefer_nhwc': False,
# Enable tunable operators
'tunable_op_enable': False,
'tunable_op_tuning_enable': False,
}),
'CPUExecutionProvider'
]
)
Key Configuration Parameters
device_id
Specifies which GPU to use (0, 1, 2, etc.). Use when you have multiple GPUs.
# Use second GPU
providers=[('CUDAExecutionProvider', {'device_id': 1})]
gpu_mem_limit
Limits GPU memory usage. Useful to prevent OOM or allow multiple processes.
# Limit to 4GB
'gpu_mem_limit': 4 * 1024 * 1024 * 1024
cudnn_conv_algo_search
Controls how cuDNN selects convolution algorithms:
- EXHAUSTIVE: Tests all algorithms, slowest first run, best performance
- HEURISTIC: Fast selection, good for development
- DEFAULT: Uses cuDNN default
enable_cuda_graph
Captures CUDA operations into a graph for better performance. Requires static input shapes.
'enable_cuda_graph': True
use_tf32
Uses TensorFloat-32 on NVIDIA Ampere GPUs (RTX 30/40 series, A100) for faster matrix operations with minimal accuracy impact.
'use_tf32': True # Default on Ampere+ GPUs
Memory Management
Arena Allocation Strategy
# Allocate memory in power-of-two chunks (default)
'arena_extend_strategy': 'kNextPowerOfTwo'
# Allocate exact amount needed (may reduce waste)
'arena_extend_strategy': 'kSameAsRequested'
Set Memory Limit
# Prevent OOM, allow multi-process usage
'gpu_mem_limit': 2 * 1024 * 1024 * 1024 # 2GB
I/O Binding (Zero-Copy)
Avoid CPU-GPU data transfers by binding GPU memory directly:
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])
# Create I/O binding
io_binding = session.io_binding()
# Bind input to GPU
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)
io_binding.bind_input(
name=input_name,
device_type='cuda',
device_id=0,
element_type=np.float32,
shape=x.shape,
buffer_ptr=x_ortvalue.data_ptr()
)
# Bind output to GPU
output_name = session.get_outputs()[0].name
io_binding.bind_output(output_name, 'cuda')
# Run inference
session.run_with_iobinding(io_binding)
outputs = io_binding.copy_outputs_to_cpu()
CUDA Streams
Use custom CUDA streams for advanced control:
import onnxruntime as ort
import torch # For stream creation
cuda_stream = torch.cuda.Stream().cuda_stream
session = ort.InferenceSession(
"model.onnx",
providers=[(
'CUDAExecutionProvider', {
'has_user_compute_stream': 1,
'user_compute_stream': cuda_stream
}
)]
)
Multi-GPU
Run different sessions on different GPUs:
import onnxruntime as ort
from multiprocessing import Process
def run_on_gpu(gpu_id, model_path):
session = ort.InferenceSession(
model_path,
providers=[('CUDAExecutionProvider', {'device_id': gpu_id})]
)
# Run inference...
# Launch on multiple GPUs
processes = []
for gpu_id in [0, 1, 2, 3]:
p = Process(target=run_on_gpu, args=(gpu_id, "model.onnx"))
p.start()
processes.append(p)
for p in processes:
p.join()
| Platform | Support | Notes |
|---|
| Linux x64 | ✅ Full | Best performance |
| Windows x64 | ✅ Full | Full feature support |
| Linux ARM64 | ✅ Full | NVIDIA Jetson |
| Windows ARM64 | ⚠️ Limited | Experimental |
| macOS | ❌ No | Use CPU EP |
Supported GPUs
Desktop GPUs
- RTX 40 Series (Ada Lovelace)
- RTX 30 Series (Ampere)
- RTX 20 Series (Turing)
- GTX 16 Series (Turing)
- GTX 10 Series (Pascal)
Data Center GPUs
- H100, A100, A40, A30, A10 (Ampere/Hopper)
- V100, T4 (Volta/Turing)
- P100, P40 (Pascal)
Embedded/Edge
- Jetson AGX Orin
- Jetson Orin Nano/NX
- Jetson Xavier NX/AGX
- Jetson Nano (limited)
Troubleshooting
Provider Not Available
import onnxruntime as ort
print(ort.get_available_providers())
# If 'CUDAExecutionProvider' is missing:
# 1. Check CUDA/cuDNN installation
# 2. Verify onnxruntime-gpu is installed
# 3. Check CUDA version compatibility
Out of Memory Errors
# Set memory limit
session = ort.InferenceSession(
"model.onnx",
providers=[('CUDAExecutionProvider', {
'gpu_mem_limit': 2 * 1024 * 1024 * 1024
})]
)
# Or use smaller batch sizes
-
Enable EXHAUSTIVE conv search:
'cudnn_conv_algo_search': 'EXHAUSTIVE'
-
Use I/O binding for repeated inference
-
Enable CUDA graph if input shapes are static:
'enable_cuda_graph': True
-
Check GPU utilization: Use
nvidia-smi to monitor GPU usage
Next Steps