RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / ONNX
Hardware & infrastructure

ONNX

ONNX (Open Neural Network Exchange) is an open-source format for representing machine learning models, designed to enable interoperability between different frameworks. For local AI operators, ONNX models can be exported from PyTorch, TensorFlow, or other frameworks and then run with ONNX Runtime, which optimizes inference across hardware (CPU, GPU, NPU). This matters because ONNX allows running models on hardware that may not have native support for the original framework, e.g., using DirectML on Windows with AMD GPUs or CoreML on Apple Silicon. However, ONNX models are less common in the local AI community than GGUF or SafeTensors, and converting to ONNX may lose some model features or require manual operator support.

Deeper dive

ONNX defines a computation graph with standardized operators (e.g., Conv, MatMul, Relu) and data types. Models are serialized as a protobuf file (.onnx). ONNX Runtime (ORT) is the inference engine that loads and executes the graph, applying optimizations like graph fusion, constant folding, and quantization. For local AI, ONNX is often used when integrating with Windows ML (DirectML) or Apple CoreML, as these backends accept ONNX models. However, ONNX has limitations: not all PyTorch operations map cleanly to ONNX ops (dynamic control flow, custom ops), and the ecosystem for large language models (LLMs) is less mature than GGUF/llama.cpp. Operators may encounter ONNX when using Hugging Face Optimum to export models for Intel OpenVINO or ONNX Runtime with GPU acceleration. Quantization in ONNX supports INT8 and FP16, but dynamic quantization for LLMs is less common than GGUF's k-quants.

Practical example

An operator with an AMD RX 7900 XTX wants to run a Whisper model for speech-to-text. PyTorch's official Whisper doesn't support ROCm well, but ONNX Runtime with DirectML does. Using Hugging Face Optimum, they export the model to ONNX: optimum-cli export onnx --model openai/whisper-small whisper_onnx/. Then they run inference with ONNX Runtime DirectML: python -m onnxruntime_genai.models.whisper --model_path whisper_onnx/. This yields ~50 tok/s on the RX 7900 XTX, whereas PyTorch with ROCm might be slower or unsupported.

Workflow example

In LM Studio, operators can load ONNX models via the 'Import Model' option, selecting a .onnx file. The runtime uses DirectML on Windows or CoreML on macOS. For example, downloading a quantized ONNX version of Phi-3-mini from Hugging Face and loading it in LM Studio: the UI shows 'Backend: ONNX Runtime' and reports tokens/sec. Operators may also use onnxruntime Python package directly: import onnxruntime as ort; session = ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider']) to run on NVIDIA GPUs.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →