Deep dive into InferenceSession configuration and lifecycle management
The InferenceSession is the primary interface for running ONNX models in ONNX Runtime. It manages model loading, optimization, initialization, and execution.
import onnxruntime as ortsess_options = ort.SessionOptions()# Intra-op threads: Parallelize within operators# Good for: Matrix multiplication, convolutionssess_options.intra_op_num_threads = 4# Inter-op threads: Execute independent operators in parallel# Good for: Models with many parallel branchessess_options.inter_op_num_threads = 2
CPU-bound Models
Complex Graphs
Single-threaded
# Use more intra-op threadssess_options.intra_op_num_threads = 8sess_options.inter_op_num_threads = 1
# Balance both thread poolssess_options.intra_op_num_threads = 4sess_options.inter_op_num_threads = 4
# Force single-threaded executionsess_options.intra_op_num_threads = 1sess_options.inter_op_num_threads = 1sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
Setting too many threads can hurt performance due to context switching and cache contention. Start with the number of physical cores.
import onnxruntime as ortimport numpy as npsession = ort.InferenceSession( "model.onnx", providers=['CUDAExecutionProvider'])io_binding = session.io_binding()# Keep input on GPUinput_data = np.random.randn(1, 3, 224, 224).astype(np.float32)io_binding.bind_cpu_input('input', input_data)# Keep output on GPUio_binding.bind_output('output', 'cuda')# Run on GPUsession.run_with_iobinding(io_binding)# Output stays on GPU - efficient for multiple operationsort_value = io_binding.get_outputs()[0]# Copy to CPU only when neededoutput = ort_value.numpy()
import onnxruntime as ortimport numpy as npsession = ort.InferenceSession("model.onnx")io_binding = session.io_binding()# Pre-allocate output bufferoutput_shape = (1, 1000) # Known output shapeoutput_buffer = np.empty(output_shape, dtype=np.float32)# Bind to pre-allocated bufferio_binding.bind_cpu_input('input', input_data)io_binding.bind_output( 'output', 'cpu', output_buffer)session.run_with_iobinding(io_binding)# Result is now in output_buffer, no copy needed
import onnxruntime as ortsess_options = ort.SessionOptions()sess_options.enable_profiling = Truesession = ort.InferenceSession("model.onnx", sess_options)# Run inferencefor _ in range(100): outputs = session.run(["output"], {"input": input_data})# Get profiling resultsprof_file = session.end_profiling()print(f"Profiling data saved to: {prof_file}")
View the profiling JSON file in Chrome’s tracing viewer (chrome://tracing) for detailed performance analysis.
Creating a session is expensive. Reuse the same session for multiple inferences:
# Good: Reuse sessionsession = ort.InferenceSession("model.onnx")for data in dataset: outputs = session.run(["output"], {"input": data})# Bad: Create session in loopfor data in dataset: session = ort.InferenceSession("model.onnx") # DON'T DO THIS outputs = session.run(["output"], {"input": data})
Thread Safety
Sessions are thread-safe for inference:
import concurrent.futuressession = ort.InferenceSession("model.onnx")def run_inference(data): return session.run(["output"], {"input": data})# Safe to use from multiple threadswith concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: results = list(executor.map(run_inference, dataset))
Use IOBinding for Performance
Use IOBinding when running multiple inferences:
session = ort.InferenceSession("model.onnx")io_binding = session.io_binding()for data in dataset: io_binding.bind_cpu_input('input', data) io_binding.bind_output('output') session.run_with_iobinding(io_binding) output = io_binding.copy_outputs_to_cpu()[0] # Process output io_binding.clear_binding_inputs() io_binding.clear_binding_outputs()