03. ONNX Runtime
ONNX Runtime provides a hardware-agnostic inference engine that abstracts across CPU, GPU, and NPU backends. For edge deployment, the runtime's execution providers pattern enables hardware acceleration without model modification.
Installation on Raspberry Pi requires a pre-built wheel matching the ARM64 architecture Python version:
pip install onnxruntime
Execution providers represent the critical distinction. By default, ONNX Runtime uses the CPU execution provider. Using hardware acceleration requires explicit provider selection:
import onnxruntime as ort
# List available providers
print(ort.get_available_providers())
# Output: ['CPUExecutionProvider', 'ARMNNExecutionProvider', 'QNNExecutionProvider']
# Create session with specific provider
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# CPU with optimization
session_cpu = ort.InferenceSession(
"model.onnx",
sess_options=options,
providers=[('CPUExecutionProvider', {'arena_extend_strategy': 'kSameAsRequested'})]
)
# ARM NN for hardware acceleration (requires libarmnn)
# session_armnn = ort.InferenceSession(
# "model.onnx",
# sess_options=options,
# providers=[('ARMNNExecutionProvider', {'backends': ['CpuAcc', 'GpuAcc']})]
# )
Inference execution follows a straightforward pattern:
import numpy as np
def run_inference(session, input_data):
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
# Ensure correct dtype and shape
input_tensor = np.array(input_data, dtype=np.float32)
outputs = session.run([output_name], {input_name: input_tensor})
return outputs[0]
# Timing measurement
import time
start = time.perf_counter()
for _ in range(100):
result = session_cpu.run(None, {input_name: input_tensor})
elapsed = (time.perf_counter() - start) / 100
print(f"Average inference time: {elapsed*1000:.2f}ms")
A frequent failure involves tensor shape mismatches. ONNX models store expected input shapes in the model metadata—printing shapes before inference confirms compatibility:
for input_meta in session.get_inputs():
print(f"Input: {input_meta.name}, shape: {input_meta.shape}, dtype: {input_meta.type}")
Memory profiling requires manual tracking because ONNX Runtime doesn't expose peak memory usage directly:
import tracemalloc
tracemalloc.start()
result = session.run([output_name], {input_name: input_tensor})
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run a simple ONNX model through ONNX Runtime, measure inference latency with time.perf_counter(), and compare CPU versus available hardware execution providers.