Documentation Index
Fetch the complete documentation index at: https://mintlify.com/microsoft/onnxruntime/llms.txt
Use this file to discover all available pages before exploring further.
QNN Execution Provider
The QNN (Qualcomm Neural Network) Execution Provider enables hardware-accelerated inference on Qualcomm platforms, including Snapdragon mobile processors, IoT devices, and edge compute platforms.
When to Use QNN EP
Use the QNN Execution Provider when:
- You’re deploying on Android devices with Qualcomm Snapdragon processors
- You need to leverage Qualcomm’s AI accelerators (Hexagon DSP, AI Engine)
- You’re building IoT or edge devices with Qualcomm chipsets
- You want optimized inference on Qualcomm compute platforms
- You need low-power, high-performance inference on mobile
Key Features
- Hexagon DSP: Leverage dedicated signal processing hardware
- AI Engine: Access specialized neural network accelerators
- Multi-Core Optimization: Utilize multiple compute units efficiently
- Low Power: Optimized for battery-powered devices
- Quantization Support: INT8 and FP16 precision modes
- Android Integration: Seamless deployment on Android devices
Prerequisites
Hardware Requirements
Supported Chipsets:
- Snapdragon 8 Gen 2/3 (flagship smartphones)
- Snapdragon 7 Series (upper mid-range)
- Snapdragon 6 Series (mid-range)
- Snapdragon 8cx Gen 3 (Windows on ARM)
- Qualcomm IoT and Edge platforms
Recommended:
- Snapdragon 888 or newer for best performance
- Devices with Hexagon 698 DSP or newer
Software Requirements
- Qualcomm Neural Processing SDK (QNN SDK)
- Android NDK (for Android deployment)
- ONNX Runtime with QNN support
- Android API Level 29+ (Android 10+)
Installation
Android (Java/Kotlin)
// app/build.gradle
dependencies {
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.17.0'
}
Android (Native C++)
# CMakeLists.txt
add_library(onnxruntime SHARED IMPORTED)
set_target_properties(onnxruntime PROPERTIES
IMPORTED_LOCATION ${ONNXRUNTIME_LIB_DIR}/libonnxruntime.so
)
target_link_libraries(your_app
onnxruntime
)
Python (Linux/Development)
# Install ONNX Runtime with QNN support
# Note: QNN support requires special build
pip install onnxruntime
# Download Qualcomm QNN SDK
# https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk
Build from Source
# Clone ONNX Runtime
git clone https://github.com/microsoft/onnxruntime.git
cd onnxruntime
# Set QNN SDK path
export QNN_SDK_ROOT=/path/to/qnn-sdk
# Build with QNN support for Android
./build.sh --config Release \
--android \
--android_abi arm64-v8a \
--android_api 29 \
--use_qnn \
--qnn_home $QNN_SDK_ROOT \
--build_shared_lib
Basic Usage
Java/Kotlin (Android)
import ai.onnxruntime.*
// Create session options
val sessionOptions = OrtSession.SessionOptions()
// Add QNN provider
sessionOptions.addQNN()
// Create environment and session
val env = OrtEnvironment.getEnvironment()
val session = env.createSession(
context.assets.open("model.onnx").readBytes(),
sessionOptions
)
// Prepare input
val inputName = session.inputNames.iterator().next()
val inputShape = longArrayOf(1, 3, 224, 224)
val inputBuffer = FloatArray(1 * 3 * 224 * 224) { /* fill with data */ }
val inputTensor = OnnxTensor.createTensor(
env,
FloatBuffer.wrap(inputBuffer),
inputShape
)
// Run inference
val inputs = mapOf(inputName to inputTensor)
val outputs = session.run(inputs)
// Get result
val output = outputs[0].value as FloatBuffer
// Clean up
inputTensor.close()
outputs.close()
C++ (Android NDK)
#include <onnxruntime_cxx_api.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "QNNExample");
Ort::SessionOptions session_options;
// Configure QNN provider
std::unordered_map<std::string, std::string> qnn_options;
qnn_options["backend_path"] = "libQnnHtp.so"; // Hexagon backend
qnn_options["qnn_context_priority"] = "high";
session_options.AppendExecutionProvider("QNN", qnn_options);
// Create session
Ort::Session session(env, "model.onnx", session_options);
// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr},
input_names.data(),
&input_tensor, 1,
output_names.data(), 1);
Python (Linux)
import onnxruntime as ort
import numpy as np
# Create session with QNN provider
session = ort.InferenceSession(
"model.onnx",
providers=[
('QNNExecutionProvider', {
'backend_path': 'libQnnHtp.so',
'qnn_context_priority': 'high'
}),
'CPUExecutionProvider'
]
)
# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Run inference
results = session.run(None, {input_name: x})
Configuration Options
Backend Selection
QNN supports multiple hardware backends:
// Hexagon DSP (best performance)
qnn_options["backend_path"] = "libQnnHtp.so";
// CPU backend (fallback, debugging)
qnn_options["backend_path"] = "libQnnCpu.so";
// GPU backend
qnn_options["backend_path"] = "libQnnGpu.so";
// Kotlin
val options = mapOf(
"backend_path" to "libQnnHtp.so"
)
sessionOptions.addExecutionProvider("QNN", options)
Priority Settings
// High priority for latency-critical tasks
qnn_options["qnn_context_priority"] = "high";
// Normal priority (default)
qnn_options["qnn_context_priority"] = "normal";
// Low priority for background tasks
qnn_options["qnn_context_priority"] = "low";
Profiling
// Enable profiling for performance analysis
qnn_options["profiling_level"] = "basic"; // or "detailed"
qnn_options["enable_htp_fp16_precision"] = "1"; // FP16 mode
Advanced Options
std::unordered_map<std::string, std::string> qnn_options;
// Backend configuration
qnn_options["backend_path"] = "libQnnHtp.so";
qnn_options["qnn_context_priority"] = "high";
// Performance tuning
qnn_options["enable_htp_fp16_precision"] = "1";
qnn_options["htp_performance_mode"] = "burst"; // sustained_high_performance, burst, power_saver, balanced
// Context configuration
qnn_options["qnn_saver_path"] = "/data/local/tmp/qnn_context";
qnn_options["enable_htp_weight_sharing"] = "1";
// Debugging
qnn_options["profiling_level"] = "basic";
qnn_options["rpc_control_latency"] = "100"; // microseconds
Quantization
QNN performs best with quantized models:
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic
# Quantize model to INT8
quantize_dynamic(
"model.onnx",
"model_int8.onnx",
weight_type=ort.QuantType.QInt8
)
# Use quantized model with QNN
session = ort.InferenceSession(
"model_int8.onnx",
providers=[('QNNExecutionProvider', {
'backend_path': 'libQnnHtp.so'
})]
)
// Maximum performance (high power)
qnn_options["htp_performance_mode"] = "burst";
// Sustained high performance
qnn_options["htp_performance_mode"] = "sustained_high_performance";
// Balanced (default)
qnn_options["htp_performance_mode"] = "balanced";
// Power saving
qnn_options["htp_performance_mode"] = "power_saver";
Context Caching
Save compiled contexts for faster initialization:
// First run: compile and save context
qnn_options["qnn_saver_path"] = "/data/local/tmp/model_context";
qnn_options["qnn_context_cache_enable"] = "1";
Ort::Session session(env, "model.onnx", session_options);
session.Run(/* ... */); // Compiles and saves context
// Subsequent runs: load from cache (much faster)
qnn_options["qnn_context_cache_path"] = "/data/local/tmp/model_context";
Ort::Session cached_session(env, "model.onnx", session_options);
FP16 Precision
Enable FP16 for better performance:
qnn_options["enable_htp_fp16_precision"] = "1";
Android Integration
Complete Android Example
import ai.onnxruntime.*
import android.content.Context
import kotlinx.coroutines.*
class ModelInference(private val context: Context) {
private lateinit var env: OrtEnvironment
private lateinit var session: OrtSession
suspend fun initialize() = withContext(Dispatchers.IO) {
env = OrtEnvironment.getEnvironment()
val sessionOptions = OrtSession.SessionOptions().apply {
// Configure QNN
val qnnOptions = mapOf(
"backend_path" to "libQnnHtp.so",
"qnn_context_priority" to "high",
"enable_htp_fp16_precision" to "1"
)
addExecutionProvider("QNN", qnnOptions)
// Additional optimizations
setIntraOpNumThreads(4)
setGraphOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
}
// Load model from assets
val modelBytes = context.assets.open("model.onnx").readBytes()
session = env.createSession(modelBytes, sessionOptions)
}
suspend fun runInference(input: FloatArray): FloatArray = withContext(Dispatchers.Default) {
val inputName = session.inputNames.first()
val inputShape = longArrayOf(1, 3, 224, 224)
// Create input tensor
val inputTensor = OnnxTensor.createTensor(
env,
java.nio.FloatBuffer.wrap(input),
inputShape
)
// Run inference
val outputs = session.run(mapOf(inputName to inputTensor))
// Extract result
val output = outputs[0].value as java.nio.FloatBuffer
val result = FloatArray(output.remaining())
output.get(result)
// Clean up
inputTensor.close()
outputs.close()
result
}
fun close() {
session.close()
env.close()
}
}
Permissions (AndroidManifest.xml)
<!-- No special permissions required for QNN -->
<!-- Optional: for loading models from external storage -->
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE" />
Asset Packaging
// app/build.gradle
android {
// ... other config ...
aaptOptions {
noCompress "onnx"
}
}
| Platform | Architecture | Support | Notes |
|---|
| Android | ARM64 | ✅ Full | Primary platform |
| Android | ARMv7 | ⚠️ Limited | Older devices |
| Linux | ARM64 | ✅ Limited | Development/testing |
| Windows on ARM | ARM64 | ✅ Limited | Snapdragon PCs |
| Linux | x64 | ❌ No | Use CPU/CUDA instead |
Supported Chipsets
- Snapdragon 8 Gen 3
- Snapdragon 8 Gen 2
- Snapdragon 888/888+
- Snapdragon 8+ Gen 1
Upper Mid-Range
- Snapdragon 7 Gen 1/2
- Snapdragon 778G/782G
- Snapdragon 870
Mid-Range
- Snapdragon 695/690
- Snapdragon 6 Gen 1
Edge/IoT
- Snapdragon 660/665
- Qualcomm IoT platforms
Troubleshooting
Provider Not Available
// Check if QNN is available
val providers = OrtEnvironment.getEnvironment().availableProviders
if ("QNN" !in providers) {
Log.w("QNN", "QNN provider not available")
// Fallback to CPU
}
Backend Loading Errors
try {
val options = mapOf("backend_path" to "libQnnHtp.so")
sessionOptions.addExecutionProvider("QNN", options)
} catch (e: Exception) {
Log.e("QNN", "Failed to load QNN backend: ${e.message}")
// Try CPU backend as fallback
val cpuOptions = mapOf("backend_path" to "libQnnCpu.so")
sessionOptions.addExecutionProvider("QNN", cpuOptions)
}
// Enable profiling to identify bottlenecks
qnn_options["profiling_level"] = "detailed";
qnn_options["enable_htp_fp16_precision"] = "1";
qnn_options["htp_performance_mode"] = "burst";
// Check logs for performance hints
// adb logcat | grep QNN
Context Save/Load Errors
# Ensure directory has correct permissions
adb shell mkdir -p /data/local/tmp/qnn_context
adb shell chmod 777 /data/local/tmp/qnn_context
# Check available space
adb shell df /data/local/tmp
Typical performance on Snapdragon 888:
| Configuration | Latency | Power | Notes |
|---|
| CPU Only | 80ms | High | Baseline |
| QNN (FP32) | 15ms | Medium | Good |
| QNN (FP16) | 8ms | Low | Better |
| QNN (INT8) | 4ms | Very Low | Best |
Best Practices
- Use Quantization: INT8 models run 2-4x faster
- Cache Contexts: Save compiled contexts to reduce init time
- Enable FP16: Minimal accuracy impact, significant speedup
- Profile First: Use profiling to identify bottlenecks
- Test on Device: Performance varies by chipset generation
Next Steps