Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
DirectML Execution Provider
The DirectML Execution Provider enables GPU acceleration on Windows using DirectML, Microsoft’s hardware-accelerated DirectX 12 API for machine learning. DirectML supports any DirectX 12-capable GPU from NVIDIA, AMD, Intel, and Qualcomm.
When to Use DirectML EP
Use the DirectML Execution Provider when:
- You’re running on Windows 10 (1903+) or Windows 11
- You need cross-vendor GPU support (NVIDIA, AMD, Intel, Qualcomm)
- You’re developing Windows desktop applications
- You want to support a wide range of GPUs without driver-specific code
- You’re targeting Windows-on-ARM devices (Surface Pro X, etc.)
- You need NPU acceleration on compatible devices
Key Features
- Cross-Vendor: Works with NVIDIA, AMD, Intel, and Qualcomm GPUs
- Wide Hardware Support: Any DirectX 12-capable GPU
- NPU Support: Leverage Neural Processing Units on compatible hardware
- Windows Integration: Optimized for Windows platform
- Single API: No need for vendor-specific SDKs
Prerequisites
Hardware Requirements
- DirectX 12-capable GPU
- Windows 10 (version 1903 or later) or Windows 11
- Minimum 2GB GPU memory recommended
Supported GPUs
- NVIDIA: GTX 900 series and newer
- AMD: Radeon RX 400 series and newer
- Intel: HD Graphics 6xx and newer (Skylake+)
- Qualcomm: Adreno GPUs in Snapdragon processors
Software Requirements
- Windows 10 (1903+) or Windows 11
- ONNX Runtime DirectML package
- Up-to-date GPU drivers
Installation
Python
# Install ONNX Runtime with DirectML support
pip install onnxruntime-directml
# Verify DirectML is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'DmlExecutionProvider'
C++
Download the DirectML-enabled build from ONNX Runtime releases:
# Download Windows DirectML package
Invoke-WebRequest -Uri "https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-win-x64-{version}.zip" -OutFile "onnxruntime.zip"
Expand-Archive onnxruntime.zip
C#/.NET
# Install NuGet package
dotnet add package Microsoft.ML.OnnxRuntime.DirectML
# For UWP applications
dotnet add package Microsoft.AI.MachineLearning
Basic Usage
Python
import onnxruntime as ort
import numpy as np
# Create session with DirectML provider
session = ort.InferenceSession(
"model.onnx",
providers=['DmlExecutionProvider']
)
# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Run inference
results = session.run(None, {input_name: x})
C++
#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;
// Add DirectML provider with device ID 0 (default GPU)
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_DML(session_options, 0));
// Create session
const wchar_t* model_path = L"model.onnx";
Ort::Session session(env, model_path, session_options);
// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr},
input_names.data(),
&input_tensor, 1,
output_names.data(), 1);
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_DML(0); // Use default GPU
using var session = new InferenceSession("model.onnx", sessionOptions);
var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;
var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> {
NamedOnnxValue.CreateFromTensor(name, tensor)
};
using var results = session.Run(inputs);
WinRT/UWP (C#)
using Microsoft.AI.MachineLearning;
// Load model
var modelFile = await StorageFile.GetFileFromApplicationUriAsync(
new Uri("ms-appx:///Assets/model.onnx")
);
var model = await LearningModel.LoadFromStorageFileAsync(modelFile);
// Create session with default device (GPU)
var session = new LearningModelSession(model);
// Or specify GPU explicitly
var device = new LearningModelDevice(LearningModelDeviceKind.DirectX);
var session = new LearningModelSession(model, device);
// Run inference
var binding = new LearningModelBinding(session);
binding.Bind("input", inputTensor);
var results = await session.EvaluateAsync(binding, "");
Configuration Options
Device Selection
import onnxruntime as ort
# Use default GPU (adapter 0)
session = ort.InferenceSession(
"model.onnx",
providers=[('DmlExecutionProvider', {'device_id': 0})]
)
# Use specific GPU (for multi-GPU systems)
session = ort.InferenceSession(
"model.onnx",
providers=[('DmlExecutionProvider', {'device_id': 1})]
)
# High performance mode
session = ort.InferenceSession(
"model.onnx",
providers=[(
'DmlExecutionProvider', {
'device_id': 0,
'performance_preference': 'high_performance' # or 'default', 'minimum_power'
}
)]
)
Device Filtering
# Target specific device types
session = ort.InferenceSession(
"model.onnx",
providers=[(
'DmlExecutionProvider', {
'device_filter': 'gpu' # 'gpu', 'npu', or 'any'
}
)]
)
Advanced Configuration
C++ Advanced Options
#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;
// Get DirectML API
const OrtDmlApi* dml_api = nullptr;
Ort::ThrowOnError(Ort::GetApi().GetExecutionProviderApi(
"DML", ORT_API_VERSION, reinterpret_cast<const void**>(&dml_api)
));
// Configure device options
OrtDmlDeviceOptions device_options;
device_options.Preference = OrtDmlPerformancePreference::HighPerformance;
device_options.Filter = OrtDmlDeviceFilter::Gpu;
// Append execution provider
Ort::ThrowOnError(dml_api->SessionOptionsAppendExecutionProvider_DML2(
session_options, &device_options
));
Ort::Session session(env, L"model.onnx", session_options);
Custom D3D12 Device
#include <d3d12.h>
#include <dml_provider_factory.h>
// Create custom D3D12 device and command queue
Microsoft::WRL::ComPtr<ID3D12Device> d3d12_device;
D3D12CreateDevice(nullptr, D3D_FEATURE_LEVEL_11_0, IID_PPV_ARGS(&d3d12_device));
D3D12_COMMAND_QUEUE_DESC queue_desc = {};
queue_desc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
Microsoft::WRL::ComPtr<ID3D12CommandQueue> command_queue;
d3d12_device->CreateCommandQueue(&queue_desc, IID_PPV_ARGS(&command_queue));
// Create DML device
Microsoft::WRL::ComPtr<IDMLDevice> dml_device;
DMLCreateDevice(d3d12_device.Get(), DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dml_device));
// Use with ONNX Runtime
const OrtDmlApi* dml_api = nullptr;
Ort::GetApi().GetExecutionProviderApi("DML", ORT_API_VERSION,
reinterpret_cast<const void**>(&dml_api));
Ort::SessionOptions session_options;
dml_api->SessionOptionsAppendExecutionProvider_DML1(
session_options, dml_device.Get(), command_queue.Get()
);
Multi-GPU Support
import onnxruntime as ort
from concurrent.futures import ThreadPoolExecutor
def run_on_gpu(gpu_id, model_path, input_data):
session = ort.InferenceSession(
model_path,
providers=[('DmlExecutionProvider', {'device_id': gpu_id})]
)
return session.run(None, input_data)
# Run on multiple GPUs concurrently
with ThreadPoolExecutor(max_workers=2) as executor:
future1 = executor.submit(run_on_gpu, 0, "model.onnx", input_data_1)
future2 = executor.submit(run_on_gpu, 1, "model.onnx", input_data_2)
result1 = future1.result()
result2 = future2.result()
NPU Acceleration
On devices with Neural Processing Units:
import onnxruntime as ort
# Target NPU if available
session = ort.InferenceSession(
"model.onnx",
providers=[(
'DmlExecutionProvider', {
'device_filter': 'npu',
'performance_preference': 'default'
}
)]
)
NPU-Compatible Devices:
- Intel Core Ultra (Meteor Lake) with Intel AI Boost
- AMD Ryzen AI processors
- Qualcomm Snapdragon X Elite/Plus
- Some Surface devices
Memory Management
# For low-memory devices, use smaller batch sizes
session = ort.InferenceSession(
"model.onnx",
providers=['DmlExecutionProvider']
)
# Process in smaller batches
batch_size = 1 # Or 4, 8 depending on GPU memory
for i in range(0, len(inputs), batch_size):
batch = inputs[i:i+batch_size]
results = session.run(None, {input_name: batch})
Session Options
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Enable memory pattern optimization
sess_options.enable_mem_pattern = True
# Enable CPU memory arena
sess_options.enable_cpu_mem_arena = True
session = ort.InferenceSession(
"model.onnx",
sess_options=sess_options,
providers=['DmlExecutionProvider']
)
| Platform | Architecture | Support |
|---|
| Windows 11 | x64 | ✅ Full |
| Windows 11 | ARM64 | ✅ Full |
| Windows 10 (1903+) | x64 | ✅ Full |
| Windows 10 (1903+) | ARM64 | ✅ Full |
| Windows Server 2019+ | x64 | ✅ Full |
| Linux | Any | ❌ No |
| macOS | Any | ❌ No |
NVIDIA GPUs
- Good performance for most models
- Consider CUDA/TensorRT for maximum performance
- DirectML useful for cross-vendor compatibility
AMD GPUs
- Excellent choice for AMD GPUs on Windows
- Often best or only option for AMD acceleration
- Good performance on RDNA architecture
Intel GPUs
- Great for Intel integrated and discrete GPUs
- Alternative to OpenVINO on Windows
- Good performance on Arc and Xe GPUs
Qualcomm (Windows on ARM)
- Primary option for GPU acceleration on ARM
- Optimized for Snapdragon processors
- Consider QNN EP for maximum Snapdragon performance
Troubleshooting
Provider Not Available
import onnxruntime as ort
print(ort.get_available_providers())
# If 'DmlExecutionProvider' is missing:
# 1. Check Windows version (need 1903+)
# 2. Verify onnxruntime-directml is installed
# 3. Update GPU drivers
# 4. Ensure DirectX 12 support
# Check which GPU is being used
import onnxruntime as ort
session = ort.InferenceSession(
"model.onnx",
providers=['DmlExecutionProvider']
)
print(f"Using providers: {session.get_providers()}")
# Try different performance preferences
session = ort.InferenceSession(
"model.onnx",
providers=[(
'DmlExecutionProvider', {
'performance_preference': 'high_performance'
}
)]
)
Out of Memory
# Reduce batch size or model size
# Check GPU memory usage in Task Manager > Performance > GPU
# Use smaller input batches
batch_size = 1
results = session.run(None, {input_name: data[:batch_size]})
Comparison with Other Providers
| Feature | DirectML | CUDA | TensorRT |
|---|
| Vendor Support | All | NVIDIA only | NVIDIA only |
| Setup Complexity | Easy | Moderate | Complex |
| Performance | Good | Better | Best |
| Windows Integration | Excellent | Good | Good |
| ARM Support | Yes | No | No |
Next Steps