ONNX
ONNX (Open Neural Network Exchange) is an open-source format for representing machine learning models, designed to enable interoperability between different frameworks. For local AI operators, ONNX models can be exported from PyTorch, TensorFlow, or other frameworks and then run with ONNX Runtime, which optimizes inference across hardware (CPU, GPU, NPU). This matters because ONNX allows running models on hardware that may not have native support for the original framework, e.g., using DirectML on Windows with AMD GPUs or CoreML on Apple Silicon. However, ONNX models are less common in the local AI community than GGUF or SafeTensors, and converting to ONNX may lose some model features or require manual operator support.
Deeper dive
ONNX defines a computation graph with standardized operators (e.g., Conv, MatMul, Relu) and data types. Models are serialized as a protobuf file (.onnx). ONNX Runtime (ORT) is the inference engine that loads and executes the graph, applying optimizations like graph fusion, constant folding, and quantization. For local AI, ONNX is often used when integrating with Windows ML (DirectML) or Apple CoreML, as these backends accept ONNX models. However, ONNX has limitations: not all PyTorch operations map cleanly to ONNX ops (dynamic control flow, custom ops), and the ecosystem for large language models (LLMs) is less mature than GGUF/llama.cpp. Operators may encounter ONNX when using Hugging Face Optimum to export models for Intel OpenVINO or ONNX Runtime with GPU acceleration. Quantization in ONNX supports INT8 and FP16, but dynamic quantization for LLMs is less common than GGUF's k-quants.
Practical example
An operator with an AMD RX 7900 XTX wants to run a Whisper model for speech-to-text. PyTorch's official Whisper doesn't support ROCm well, but ONNX Runtime with DirectML does. Using Hugging Face Optimum, they export the model to ONNX: optimum-cli export onnx --model openai/whisper-small whisper_onnx/. Then they run inference with ONNX Runtime DirectML: python -m onnxruntime_genai.models.whisper --model_path whisper_onnx/. This yields ~50 tok/s on the RX 7900 XTX, whereas PyTorch with ROCm might be slower or unsupported.
Workflow example
In LM Studio, operators can load ONNX models via the 'Import Model' option, selecting a .onnx file. The runtime uses DirectML on Windows or CoreML on macOS. For example, downloading a quantized ONNX version of Phi-3-mini from Hugging Face and loading it in LM Studio: the UI shows 'Backend: ONNX Runtime' and reports tokens/sec. Operators may also use onnxruntime Python package directly: import onnxruntime as ort; session = ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider']) to run on NVIDIA GPUs.
Reviewed by Fredoline Eruo. See our editorial policy.