Operating manual · Runtime compatibility

Local AI runtime compatibility matrix

What actually runs on your hardware. The cross-OS, cross-backend matrix for 25 runtimes — vLLM, llama.cpp, Ollama, MLX, ONNX Runtime, IPEX-LLM, ExLlamaV2, and the rest. Every cell carries an operator caveat, not a checkmark.

Runtime	OS			Backend				Mobile	Docker	Quant				Fit			Caveat	Best for
	Win	macOS	Linux	CUDA	ROCm	Apple	Intel			GGUF	AWQ	GPTQ	FP8	Beg.	Prod	Dist
Ollama	✓	✓	✓	✓	✓	✓	◐	·	✓	✓	·	·	·	✓	·	·	Single-user, sequential decode. Concurrency tops out at 4-8 requests; no continuous batching.	first install, model swapping, hobby use
LM Studio	✓	✓	✓	✓	◐	✓	·	·	·	✓	·	·	·	✓	·	·	GUI-first with built-in Hugging Face browser. Not a server; OpenAI-compatible endpoint is local-host only by default.	non-CLI users, model exploration
vLLM	·	·	✓	✓	◐	·	·	·	✓	·	✓	✓	✓	·	✓	✓	Linux + NVIDIA-only in practice. AMD ROCm path exists but lags. Continuous batching + PagedAttention is the production reference.	production serving, multi-tenant inference
SGLang	·	·	✓	✓	◐	·	·	·	✓	·	✓	✓	✓	·	✓	✓	RadixAttention prefix-cache compounds for agent loops with stable system prompts. Beats vLLM at high prefix-cache hit rates.	agent serving, multi-tenant with stable prompts
llama.cpp	✓	✓	✓	✓	✓	✓	◐	◐	✓	✓	·	·	·	·	·	◐	The most-portable runtime. CPU + every GPU backend. Layer-split for asymmetric multi-GPU. Lags vLLM at concurrency.	cross-platform deployment, asymmetric GPU pairs
ExLlamaV2	✓	·	✓	✓	✓	·	·	·	✓	·	·	✓	·	·	·	·	EXL2 quants are sharper than GGUF at the same size. Single-stream throughput leader on consumer NVIDIA. NVIDIA + AMD; no Apple.	consumer NVIDIA, max single-stream tok/s
TabbyAPI	✓	·	✓	✓	✓	·	·	·	✓	·	·	✓	·	·	·	·	OpenAI-compatible HTTP server in front of ExLlamaV2. The production-style wrapper for EXL2 deployments.	EXL2-based serving with API
MLX-LM	·	✓	·	·	·	✓	·	·	·	·	·	·	·	·	·	·	Apple-first-party. MLX-4bit / MLX-8bit quants only. Runs on M-series unified memory; no Intel Mac path.	Apple Silicon Macs, unified memory deployments
MLX Swift	·	✓	·	·	·	✓	·	✓	·	·	·	·	·	·	✓	·	iOS / iPadOS / macOS app-bundled inference. Production-grade for App Store deployments. Same checkpoints as desktop MLX.	iOS app-bundled local LLM inference
MLC LLM	✓	✓	✓	✓	◐	✓	◐	✓	✓	·	·	·	·	·	·	·	TVM-based; compiles models for any GPU with Vulkan/Metal/WebGPU. Cross-platform mobile reference. Compile-time overhead.	cross-platform mobile + WebGPU
TensorRT-LLM	·	·	✓	✓	·	·	·	·	✓	·	✓	·	✓	·	✓	✓	Highest peak throughput on H100 / H200 + the FP8 transformer engine. Recompile-per-config friction is real.	datacenter NVIDIA at peak throughput
Text Generation Inference (TGI)	·	·	✓	✓	✓	·	◐	·	✓	·	✓	✓	✓	·	✓	✓	Hugging Face's serving runtime. Inference Endpoints + TGI is the HF-native production path; community uptake trails vLLM.	HF Inference Endpoints, HF-native deployments
Ray Serve	◐	✓	✓	✓	◐	✓	◐	·	✓	—	—	—	—	·	✓	✓	Orchestration layer wrapping vLLM / SGLang / TGI replicas. Scales request throughput, not single-model size.	multi-replica serving + autoscaling
Exo	·	✓	✓	◐	◐	✓	·	·	·	✓	·	·	·	·	·	✓	Multi-Mac / mixed-device clustering over Thunderbolt + LAN. Layer-shards a single model across machines.	multi-Mac clusters, >192GB unified memory targets
Petals	·	✓	✓	✓	◐	✓	·	·	✓	·	·	·	·	·	·	✓	Bittorrent-style distributed inference. Public-swarm mode is research-tier; private-swarm production deployments exist but rare.	research, private compute-pooling
ONNX Runtime	✓	✓	✓	✓	◐	✓	✓	✓	✓	·	·	·	◐	·	✓	·	Microsoft's cross-platform inference runtime. Strongest on classical models; LLM-specific optimizations behind vLLM / llama.cpp.	cross-OS / cross-backend production with one runtime
Intel OpenVINO	✓	✓	✓	·	·	·	✓	·	✓	·	·	·	·	·	✓	·	Intel-only — Arc GPU + Lunar Lake NPU + AVX-512 CPU paths. The reference runtime for Intel hardware.	Intel Arc + Lunar Lake / Meteor Lake NPU
IPEX-LLM	✓	·	✓	·	·	·	✓	◐	✓	✓	✓	✓	·	·	·	·	Intel's PyTorch-native LLM runtime. The first-class path for running LLMs on Intel Arc A770/B580 + Lunar Lake NPU + IPEX-Ollama.	Intel Arc GPU LLM inference
CTranslate2	✓	✓	✓	✓	·	✓	◐	·	✓	·	·	·	·	·	✓	·	Specialized transformer runtime. Whisper (faster-whisper) reference. Encoder-decoder optimization that LLM runtimes don't prioritize.	Whisper, NMT, encoder-decoder inference
DirectML	✓	·	·	✓	✓	·	✓	·	·	·	·	·	◐	·	✓	·	Windows DirectX 12 inference backend. Vendor-agnostic on Windows: AMD / Intel / Qualcomm GPU + NPU without ROCm or vendor SDKs.	Windows multi-vendor GPU + NPU inference
llama-cpp-python	✓	✓	✓	✓	✓	✓	◐	·	✓	✓	·	·	·	✓	·	·	Python bindings + OpenAI-compatible server. The fastest path from `pip install` to a working endpoint. Backend pin via wheel choice.	Python-first integration, scripting, prototyping
Aphrodite Engine	◐	·	✓	✓	✓	·	·	·	✓	·	✓	✓	✓	·	·	✓	vLLM fork specialized for creative-writing / role-play. Adds DRY / XTC / dynatemp samplers vLLM doesn't ship.	SillyTavern + role-play workloads
ExecuTorch	·	✓	✓	·	·	✓	·	✓	·	·	·	·	·	·	·	·	PyTorch's mobile/edge inference runtime. Compiles PyTorch models for Android (NNAPI/GPU/NPU) and iOS (Metal/CoreML).	PyTorch-native mobile deployment
Qualcomm AI Hub	✓	·	·	·	·	·	·	✓	·	·	·	·	·	·	·	·	Qualcomm's NPU compiler + model zoo. Snapdragon-only. Pre-quantized variants for Llama / Phi / Gemma / Qwen on Hexagon NPU.	Snapdragon NPU production deployment
ONNX Runtime Mobile	✓	·	·	·	·	·	✓	✓	·	·	·	·	·	·	✓	·	Mobile/edge variant of ONNX Runtime. The reference path for Snapdragon X / Lunar Lake / Ryzen AI on Windows + Copilot+ PC NPU.	Copilot+ PC NPU, cross-vendor Windows

Going deeper

Setup path-finder — answer 4 questions, get a recommended runtime + first commands.
Linux local AI system guide — operator-grade Linux deployment paths.
Multi-GPU buying + deployment guide — the runtime/topology cross-cut.
Full tool catalog — operational reviews per runtime.