Operating manual · Runtime compatibility
Local AI runtime compatibility matrix
What actually runs on your hardware. The cross-OS, cross-backend matrix for 25 runtimes — vLLM, llama.cpp, Ollama, MLX, ONNX Runtime, IPEX-LLM, ExLlamaV2, and the rest. Every cell carries an operator caveat, not a checkmark.
| Runtime | OS | Backend | Mobile | Docker | Quant | Fit | Caveat | Best for | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Win | macOS | Linux | CUDA | ROCm | Apple | Intel | GGUF | AWQ | GPTQ | FP8 | Beg. | Prod | Dist | |||||
| Ollama | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ◐ | · | ✓ | ✓ | · | · | · | ✓ | · | · | Single-user, sequential decode. Concurrency tops out at 4-8 requests; no continuous batching. | first install, model swapping, hobby use |
| LM Studio | ✓ | ✓ | ✓ | ✓ | ◐ | ✓ | · | · | · | ✓ | · | · | · | ✓ | · | · | GUI-first with built-in Hugging Face browser. Not a server; OpenAI-compatible endpoint is local-host only by default. | non-CLI users, model exploration |
| vLLM | · | · | ✓ | ✓ | ◐ | · | · | · | ✓ | · | ✓ | ✓ | ✓ | · | ✓ | ✓ | Linux + NVIDIA-only in practice. AMD ROCm path exists but lags. Continuous batching + PagedAttention is the production reference. | production serving, multi-tenant inference |
| SGLang | · | · | ✓ | ✓ | ◐ | · | · | · | ✓ | · | ✓ | ✓ | ✓ | · | ✓ | ✓ | RadixAttention prefix-cache compounds for agent loops with stable system prompts. Beats vLLM at high prefix-cache hit rates. | agent serving, multi-tenant with stable prompts |
| llama.cpp | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ◐ | ◐ | ✓ | ✓ | · | · | · | · | · | ◐ | The most-portable runtime. CPU + every GPU backend. Layer-split for asymmetric multi-GPU. Lags vLLM at concurrency. | cross-platform deployment, asymmetric GPU pairs |
| ExLlamaV2 | ✓ | · | ✓ | ✓ | ✓ | · | · | · | ✓ | · | · | ✓ | · | · | · | · | EXL2 quants are sharper than GGUF at the same size. Single-stream throughput leader on consumer NVIDIA. NVIDIA + AMD; no Apple. | consumer NVIDIA, max single-stream tok/s |
| TabbyAPI | ✓ | · | ✓ | ✓ | ✓ | · | · | · | ✓ | · | · | ✓ | · | · | · | · | OpenAI-compatible HTTP server in front of ExLlamaV2. The production-style wrapper for EXL2 deployments. | EXL2-based serving with API |
| MLX-LM | · | ✓ | · | · | · | ✓ | · | · | · | · | · | · | · | · | · | · | Apple-first-party. MLX-4bit / MLX-8bit quants only. Runs on M-series unified memory; no Intel Mac path. | Apple Silicon Macs, unified memory deployments |
| MLX Swift | · | ✓ | · | · | · | ✓ | · | ✓ | · | · | · | · | · | · | ✓ | · | iOS / iPadOS / macOS app-bundled inference. Production-grade for App Store deployments. Same checkpoints as desktop MLX. | iOS app-bundled local LLM inference |
| MLC LLM | ✓ | ✓ | ✓ | ✓ | ◐ | ✓ | ◐ | ✓ | ✓ | · | · | · | · | · | · | · | TVM-based; compiles models for any GPU with Vulkan/Metal/WebGPU. Cross-platform mobile reference. Compile-time overhead. | cross-platform mobile + WebGPU |
| TensorRT-LLM | · | · | ✓ | ✓ | · | · | · | · | ✓ | · | ✓ | · | ✓ | · | ✓ | ✓ | Highest peak throughput on H100 / H200 + the FP8 transformer engine. Recompile-per-config friction is real. | datacenter NVIDIA at peak throughput |
| Text Generation Inference (TGI) | · | · | ✓ | ✓ | ✓ | · | ◐ | · | ✓ | · | ✓ | ✓ | ✓ | · | ✓ | ✓ | Hugging Face's serving runtime. Inference Endpoints + TGI is the HF-native production path; community uptake trails vLLM. | HF Inference Endpoints, HF-native deployments |
| Ray Serve | ◐ | ✓ | ✓ | ✓ | ◐ | ✓ | ◐ | · | ✓ | — | — | — | — | · | ✓ | ✓ | Orchestration layer wrapping vLLM / SGLang / TGI replicas. Scales request throughput, not single-model size. | multi-replica serving + autoscaling |
| Exo | · | ✓ | ✓ | ◐ | ◐ | ✓ | · | · | · | ✓ | · | · | · | · | · | ✓ | Multi-Mac / mixed-device clustering over Thunderbolt + LAN. Layer-shards a single model across machines. | multi-Mac clusters, >192GB unified memory targets |
| Petals | · | ✓ | ✓ | ✓ | ◐ | ✓ | · | · | ✓ | · | · | · | · | · | · | ✓ | Bittorrent-style distributed inference. Public-swarm mode is research-tier; private-swarm production deployments exist but rare. | research, private compute-pooling |
| ONNX Runtime | ✓ | ✓ | ✓ | ✓ | ◐ | ✓ | ✓ | ✓ | ✓ | · | · | · | ◐ | · | ✓ | · | Microsoft's cross-platform inference runtime. Strongest on classical models; LLM-specific optimizations behind vLLM / llama.cpp. | cross-OS / cross-backend production with one runtime |
| Intel OpenVINO | ✓ | ✓ | ✓ | · | · | · | ✓ | · | ✓ | · | · | · | · | · | ✓ | · | Intel-only — Arc GPU + Lunar Lake NPU + AVX-512 CPU paths. The reference runtime for Intel hardware. | Intel Arc + Lunar Lake / Meteor Lake NPU |
| IPEX-LLM | ✓ | · | ✓ | · | · | · | ✓ | ◐ | ✓ | ✓ | ✓ | ✓ | · | · | · | · | Intel's PyTorch-native LLM runtime. The first-class path for running LLMs on Intel Arc A770/B580 + Lunar Lake NPU + IPEX-Ollama. | Intel Arc GPU LLM inference |
| CTranslate2 | ✓ | ✓ | ✓ | ✓ | · | ✓ | ◐ | · | ✓ | · | · | · | · | · | ✓ | · | Specialized transformer runtime. Whisper (faster-whisper) reference. Encoder-decoder optimization that LLM runtimes don't prioritize. | Whisper, NMT, encoder-decoder inference |
| DirectML | ✓ | · | · | ✓ | ✓ | · | ✓ | · | · | · | · | · | ◐ | · | ✓ | · | Windows DirectX 12 inference backend. Vendor-agnostic on Windows: AMD / Intel / Qualcomm GPU + NPU without ROCm or vendor SDKs. | Windows multi-vendor GPU + NPU inference |
| llama-cpp-python | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ◐ | · | ✓ | ✓ | · | · | · | ✓ | · | · | Python bindings + OpenAI-compatible server. The fastest path from `pip install` to a working endpoint. Backend pin via wheel choice. | Python-first integration, scripting, prototyping |
| Aphrodite Engine | ◐ | · | ✓ | ✓ | ✓ | · | · | · | ✓ | · | ✓ | ✓ | ✓ | · | · | ✓ | vLLM fork specialized for creative-writing / role-play. Adds DRY / XTC / dynatemp samplers vLLM doesn't ship. | SillyTavern + role-play workloads |
| ExecuTorch | · | ✓ | ✓ | · | · | ✓ | · | ✓ | · | · | · | · | · | · | · | · | PyTorch's mobile/edge inference runtime. Compiles PyTorch models for Android (NNAPI/GPU/NPU) and iOS (Metal/CoreML). | PyTorch-native mobile deployment |
| Qualcomm AI Hub | ✓ | · | · | · | · | · | · | ✓ | · | · | · | · | · | · | · | · | Qualcomm's NPU compiler + model zoo. Snapdragon-only. Pre-quantized variants for Llama / Phi / Gemma / Qwen on Hexagon NPU. | Snapdragon NPU production deployment |
| ONNX Runtime Mobile | ✓ | · | · | · | · | · | ✓ | ✓ | · | · | · | · | · | · | ✓ | · | Mobile/edge variant of ONNX Runtime. The reference path for Snapdragon X / Lunar Lake / Ryzen AI on Windows + Copilot+ PC NPU. | Copilot+ PC NPU, cross-vendor Windows |
Going deeper
- Setup path-finder — answer 4 questions, get a recommended runtime + first commands.
- Linux local AI system guide — operator-grade Linux deployment paths.
- Multi-GPU buying + deployment guide — the runtime/topology cross-cut.
- Full tool catalog — operational reviews per runtime.