03. ONNX Runtime

Chapter 3 of 18 · 20 min

ONNX Runtime provides a hardware-agnostic inference engine that abstracts across CPU, GPU, and NPU backends. For edge deployment, the runtime's execution providers pattern enables hardware acceleration without model modification.

Installation on Raspberry Pi requires a pre-built wheel matching the ARM64 architecture Python version:

pip install onnxruntime

Execution providers represent the critical distinction. By default, ONNX Runtime uses the CPU execution provider. Using hardware acceleration requires explicit provider selection:

import onnxruntime as ort

# List available providers
print(ort.get_available_providers())
# Output: ['CPUExecutionProvider', 'ARMNNExecutionProvider', 'QNNExecutionProvider']

# Create session with specific provider
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# CPU with optimization
session_cpu = ort.InferenceSession(
    "model.onnx",
    sess_options=options,
    providers=[('CPUExecutionProvider', {'arena_extend_strategy': 'kSameAsRequested'})]
)

# ARM NN for hardware acceleration (requires libarmnn)
# session_armnn = ort.InferenceSession(
#     "model.onnx",
#     sess_options=options,
#     providers=[('ARMNNExecutionProvider', {'backends': ['CpuAcc', 'GpuAcc']})]
# )

Inference execution follows a straightforward pattern:

import numpy as np

def run_inference(session, input_data):
    input_name = session.get_inputs()[0].name
    output_name = session.get_outputs()[0].name
    
    # Ensure correct dtype and shape
    input_tensor = np.array(input_data, dtype=np.float32)
    
    outputs = session.run([output_name], {input_name: input_tensor})
    return outputs[0]

# Timing measurement
import time

start = time.perf_counter()
for _ in range(100):
    result = session_cpu.run(None, {input_name: input_tensor})
elapsed = (time.perf_counter() - start) / 100

print(f"Average inference time: {elapsed*1000:.2f}ms")

A frequent failure involves tensor shape mismatches. ONNX models store expected input shapes in the model metadata—printing shapes before inference confirms compatibility:

for input_meta in session.get_inputs():
    print(f"Input: {input_meta.name}, shape: {input_meta.shape}, dtype: {input_meta.type}")

Memory profiling requires manual tracking because ONNX Runtime doesn't expose peak memory usage directly:

import tracemalloc

tracemalloc.start()
result = session.run([output_name], {input_name: input_tensor})
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run a simple ONNX model through ONNX Runtime, measure inference latency with time.perf_counter(), and compare CPU versus available hardware execution providers.