RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /ONNX Runtime
runner
Open source
free + open-source

ONNX Runtime

Microsoft's cross-platform inference runtime for ONNX models. The reference path when you need a single runtime that targets CUDA + DirectML + CoreML + OpenVINO + ROCm from one binary. Stronger on classical models (vision, NLP, speech) than on LLMs vs vLLM/llama.cpp.

By Fredoline Eruo·Last verified May 7, 2026·16,000 GitHub stars

Overview

What ONNX Runtime actually is

ONNX Runtime is Microsoft's cross-platform inference runtime for the ONNX model format, and the only meaningful single-runtime path that targets CUDA, DirectML, CoreML, OpenVINO, and ROCm from one binary. It is not a new training framework, not a new model format outside ONNX, and not specifically an LLM engine — it is a graph-execution runtime that runs whatever ONNX model you hand it on whatever Execution Provider (EP) the host hardware supports.

That positioning means ONNX Runtime is strongest on classical ML workloads (vision, NLP encoders, speech, embeddings, classical transformers used in feature pipelines) and weakest, comparatively, on bleeding-edge LLM serving. For a 70B-class generative model in production, vLLM and TensorRT-LLM outclass it on throughput. For a vision model that has to run on a customer's Surface laptop and a Mac and a Linux box from one binary, ONNX Runtime is unrivaled.

Where it fits in the stack

ONNX Runtime lives at the runtime layer for cross-platform model deployment. The typical stack:

  • Source model: PyTorch / TensorFlow / scikit-learn → exported to ONNX
  • Optimization: onnxruntime.quantization for INT8 / W4A16, olive for graph-level fusion
  • Runtime: ONNX Runtime + the right Execution Provider for the target
  • Hardware: anything an EP exists for — NVIDIA, AMD, Intel CPU/GPU/NPU, Apple Silicon, Snapdragon NPUs

It is not the right runtime for a 70B FP8 chatbot on H100 (use TensorRT-LLM), and it is not the right runtime for a GGUF-only homelab (use llama.cpp). It is the runtime for "this model has to ship on five different OS / GPU combinations and I need one inference path."

Best use cases

  • Cross-platform desktop ML. Vision, OCR, speech, embeddings, classical transformers shipping inside a desktop app on Windows + macOS + Linux from one ONNX file.
  • Windows + AMD GPU inference via DirectML. The DirectML EP is the cleanest "AMD GPU on Windows" path that exists outside llama.cpp's HIPBLAS.
  • NPU-targeted inference. Snapdragon X / Lunar Lake NPUs both expose ONNX Runtime EPs. See /stacks/android-on-device-ai.
  • On-device embeddings for RAG. Sentence-transformers exported to ONNX run fast on CPU and fit cleanly into a desktop app.
  • Mobile ML. ONNX Runtime Mobile (a separate but related build) is the Android-default for non-LLM ML.

OS support

OS Quality Notes
Windows 10 / 11 excellent reference platform; full DirectML EP coverage
Linux (x86_64) excellent full CUDA, ROCm, OpenVINO, CPU
macOS (Apple Silicon) excellent CoreML EP for ANE / GPU; CPU baseline
Linux (ARM64) good CPU + Vulkan-class fallbacks
Android good via ONNX Runtime Mobile
iOS good via CoreML EP + ONNX Runtime Mobile

Hardware / backend support

The EP catalog in May 2026 (only the relevant ones for AI are listed):

  • CUDA EP (NVIDIA all generations)
  • TensorRT EP (NVIDIA; uses TensorRT under the hood for maximum throughput on supported ops)
  • DirectML EP (Windows; AMD + Intel + NVIDIA + Snapdragon NPU)
  • CoreML EP (Apple Silicon; targets ANE + GPU + CPU)
  • OpenVINO EP (Intel CPU / iGPU / NPU; see OpenVINO)
  • ROCm EP (AMD on Linux; see ROCm)
  • CPU EP (every platform; the always-available fallback)
  • QNN EP (Qualcomm Snapdragon NPU; the Snapdragon X Elite path)

Model / quant format support

  • FP32 / FP16 / BF16 — the baseline
  • INT8 — onnxruntime.quantization produces both static and dynamic INT8 models
  • W4A16 (INT4 weights, FP16 activations) — supported via the Olive toolkit; the LLM-relevant precision
  • NF4 / FP8 — partial; lags TensorRT-LLM
  • No GGUF, no AWQ-INT4 directly, no EXL2, no MLX — different ecosystem

For the cross-runtime quant ladder see /systems/quantization-formats.

Setup path

The Python install:

pip install onnxruntime-gpu      # CUDA EP
# or:
pip install onnxruntime-directml # DirectML EP

A minimal inference call:

import onnxruntime as ort
sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
out = sess.run(None, {"input_ids": ids})

For an LLM-shaped workflow, export from Hugging Face with optimum-cli export onnx, then optimize with the Olive toolchain. The full pipeline is documented at onnxruntime.ai.

What breaks first

  1. EP fallback to CPU silently. If the GPU EP fails to initialize (driver, CUDA mismatch, missing library), ONNX Runtime falls back to CPU EP without raising. Always log the active EP at startup.
  2. HF -> ONNX conversion drift. Newer architectures (novel attention, MoE routers) sometimes need patched exporters; the conversion step is the most common source of "the ONNX model produces different outputs than the HF original."
  3. DirectML quirks on the AMD path. Some ops fall back to CPU; per-op profiling is the only way to find them.
  4. CUDA + cuDNN version pinning. The CUDA EP is built against specific cuDNN majors; mixing minors can produce silent corruption.
  5. Mobile build size. ONNX Runtime Mobile needs explicit op-set pruning to keep APK / IPA size sane.

Alternatives by intent

If you want… Reach for
LLM-tuned high-throughput serving vLLM, TensorRT-LLM
GGUF-native local LLMs llama.cpp or Ollama
Apple-native MLX-LM (LLMs) or CoreML directly (classical)
Intel CPU / iGPU / NPU first-party OpenVINO directly (no ONNX layer)
Mobile-only ExecuTorch, MLC-LLM, or ONNX Runtime Mobile

Best pairings

  • OpenVINO EP for Intel hardware — the cleanest "Intel CPU + iGPU + NPU" path
  • DirectML EP + a Windows desktop app — the cleanest cross-vendor Windows GPU path
  • CoreML EP + a macOS app — the ANE-aware Apple path
  • Snapdragon X Elite + QNN EP — the laptop NPU path
  • Apple A18 Pro + CoreML EP via ONNX Runtime Mobile — the iOS NPU path

Who should avoid ONNX Runtime

  • Operators serving 70B+ generative LLMs in production. The throughput tier above ONNX Runtime exists; use vLLM or TensorRT-LLM.
  • Homelabs running GGUF-native models. No reason to go through an ONNX export step.
  • Workloads that need maximum AWQ / EXL2 / FP8 throughput. Wrong runtime; pick CUDA-server engines.
  • Single-platform deployments. If you're deploying only on Linux + NVIDIA, the cross-platform overhead is wasted; pick a CUDA-native runtime directly.

Related

  • Stacks: /stacks/android-on-device-ai, /stacks/private-rag-laptop
  • System guides: /systems/quantization-formats, /setup
  • Hardware: Snapdragon X Elite, Apple A18 Pro, RTX 4090
  • Errors: /errors/wsl2-gpu-not-detected

Pros

  • Cross-platform + cross-backend with one runtime — rare in this space
  • DirectML provider unlocks Windows + AMD + NPU paths most Linux-native runtimes can't reach
  • Microsoft-maintained — production-grade roadmap

Cons

  • LLM-specific optimizations behind vLLM and llama.cpp
  • Hugging Face → ONNX conversion is an extra step vs direct GGUF / safetensors
  • Quant ecosystem narrower than the Hugging Face mainline

Compatibility

Operating systems
Windows
macOS
Linux
GPU backends
NVIDIA CUDA
DirectML
CoreML
OpenVINO
ROCm
LicenseOpen source · free + open-source

Runtime health

Operator-grade signals on how actively ONNX Runtime is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated May 7, 2026

6 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Get ONNX Runtime

Official site
https://onnxruntime.ai
GitHub
https://github.com/microsoft/onnxruntime

Frequently asked

Is ONNX Runtime free?

ONNX Runtime has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does ONNX Runtime support?

ONNX Runtime supports Windows, macOS, Linux.

Which GPUs work with ONNX Runtime?

ONNX Runtime supports NVIDIA CUDA, DirectML, CoreML, OpenVINO, ROCm. CPU-only inference is also possible but slow.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • Apple M4 Max vs RTX 4090 →
Buyer guides
  • Best GPU for local AI →
  • Best budget GPU →
When it doesn't work
  • llama.cpp too slow →
  • llama.cpp build failed →
  • llama.cpp Metal crash (Mac) →
  • GGUF tokenizer mismatch →
Recommended hardware
  • RTX 3090 (used) →
  • Apple M4 Max →
Alternatives
MLX-LMExLlamaV2llama.cppLlamafileOllamaIPEX-LLMCTranslate2Intel OpenVINO
Before you buy

Verify ONNX Runtime runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →