The local AI inference runtime landscape
Six zones covering every runtime that hosts LLM weights and produces tokens — desktop locals, high-throughput servers, Apple Silicon, quantized engines, distributed systems, and the agent/runtime bridges that turn an engine into an application. Read /systems/distributed-inference for the architectural deep dive on the multi-machine end of the spectrum.
Desktop / single-user runtimes
The runtimes you install on a laptop and forget about. Ollama is the curated default; llama.cpp is the engine underneath; LM Studio is the GUI-first alternative; Llamafile is the zero-install single-binary path. The 90% case for individual developers — start here unless you have specific reasons to pick something else.
Ollama
The default first-pull tool for local AI. One-line model installs (`ollama run llama3.1`), an OpenAI-compatible HTTP API, good defaults out of the box. Built on llama.cpp.
llama.cpp
The bedrock of local LLM inference. Most other tools wrap or embed it. Maximum control, maximum platform support, sharpest learning curve.
LM Studio
Polished desktop GUI for local LLMs. Built-in HuggingFace search, OpenAI-compatible local server, side-by-side conversations.
Llamafile
Mozilla's single-binary llama.cpp distribution. Download one file, run on any OS without dependencies.
High-throughput serving runtimes
Production-scale GPU serving. vLLM is the ecosystem default; SGLang is the credible challenger on shared-prefix workloads; TensorRT-LLM is NVIDIA's first-party compiled engine for absolute lowest latency; TGI is the HuggingFace-tied option that vLLM ate the lunch of through 2024-2025; LocalAI is the multi-modal multiplexer. Pick by traffic shape and hardware commitment.
vLLM
High-throughput inference engine with PagedAttention, continuous batching, and tensor + pipeline parallelism. The reference deployment runtime when you've outgrown llama.cpp / Ollama for production se
SGLang
Structured generation language + runtime for LLM programs. RadixAttention reuses KV cache across prompts with shared prefixes — significant throughput wins for agent workloads where many tool calls sh
TensorRT-LLM
NVIDIA's first-party inference compiler. Generates optimized engines per model + GPU pair, with the lowest latency on NVIDIA hardware. The pick when you're committed to a single SKU and need the absol
Text Generation Inference (TGI)
HuggingFace's production inference server. Slightly behind vLLM on raw throughput but tighter integration with the HF ecosystem.
LocalAI
OpenAI-API-compatible drop-in for self-hosted inference, with a multi-backend twist: the same endpoint can serve LLMs (llama.cpp / vLLM under the hood), embeddings, image gen (stable-diffusion.cpp), a
Apple Silicon runtimes
The Apple-Silicon-native path. MLX-LM is now competitive with llama.cpp Metal on M-series hardware, with stronger long-context performance. The 2026 unlock here was Thunderbolt 5 + macOS 26.2 RDMA, which reshapes what's possible for multi-Mac clusters.
Quantized inference runtimes
The runtimes optimized around specific quantization formats. ExLlamaV2 dominates EXL2-format inference on consumer NVIDIA cards (single-card 4090 / 5090 throughput champion). TabbyAPI is its OpenAI-compatible HTTP wrapper. The pick when you've committed to EXL2 quants and want maximum tokens-per-second on one card.
ExLlamaV2
Hand-optimized inference for EXL2-quantized models. Fastest single-GPU runtime for the EXL2 quant format on Ada/Hopper hardware. Lower-level than llama.cpp; pairs with text-generation-webui + TabbyAPI
TabbyAPI
OpenAI-API frontend for ExLlamaV2. Wraps the EXL2 inference engine in a clean HTTP API, adds streaming, batching, and OAI-compatible chat templates. The default front-of-house when you've already comm
Distributed serving systems
The category nobody covers well online. Exo is the credible Apple-Silicon-cluster path (Thunderbolt 5 RDMA changed the game). Petals is the BitTorrent-style internet swarm (slow but possible for models you can't fit anywhere). Ray Serve is the K8s-grade orchestrator for multi-node vLLM/SGLang. Hyperspace is the consumer P2P entrant. See /systems/distributed-inference for the architecture deep dive.
Exo
Personal AI cluster software. Auto-discovers Apple Silicon devices on a LAN and shards a model across them via pipeline + tensor parallelism on top of MLX. The 2026 unlock: Thunderbolt 5 + macOS 26.2
Petals
BitTorrent-style decentralized LLM inference. Splits a model into transformer-block shards distributed across volunteer hosts on the public internet — one client runs the input/output layers locally a
Ray Serve
Distributed model serving on top of Ray. Lets you stitch vLLM / SGLang / custom runtimes into a multi-replica, multi-model deployment with autoscaling, traffic splitting, and pipeline composition. The
Hyperspace (P2P inference network)
Decentralized peer-to-peer AI inference network. 2.7M+ CLI downloads, 2M+ active nodes globally as of April 2026. Three-tier model routing (local registry → DHT → gossip broadcast) supports any GGUF m
Agent / runtime bridges
Not runtimes themselves — but the layer that turns a runtime into an actually-usable application. Open WebUI is the production-grade chat frontend; AnythingLLM is the RAG-workspace front door. Both speak any OpenAI-compatible runtime; both extend the runtime's reach into real workflows.
Ecosystem winners
The runtimes that compounded their ecosystem leads through the 2025-2026 cycle:
- Ollama is now the default first-pull tool for every newcomer to local AI. The curated model library and zero-config setup beat every alternative on time-to-first-token.
- vLLM remains the production-default GPU serving engine. PagedAttention turned KV-cache efficiency from a research footnote into a 5-24x throughput delta, and the project's discipline through the cycle (continuous batching, prefix caching, chunked prefill, multi-LoRA, speculative decoding) widens the moat.
- llama.cpp is the bedrock most other runtimes sit on. Ollama wraps it; LM Studio bundles it; Llamafile ships it as one binary. Every quant kernel improvement propagates to all of them.
- MLX-LM caught up to llama.cpp on Apple Silicon and surpassed it on long-context. The 2026 Thunderbolt 5 unlock makes MLX the natural winner of the Apple-cluster category as Exo expands.
Declining runtimes
Runtimes whose ecosystem position eroded through the cycle:
- Text Generation Inference (TGI) was the production default in 2023-2024; vLLM ate that lunch through 2024-2025. New deployments default to vLLM unless HuggingFace Hub integration matters specifically.
- FasterTransformer (NVIDIA's pre-TensorRT-LLM kernel library) faded through 2024 as TensorRT-LLM took its place. Largely a historical reference now.
Benchmark opportunities
The measurements that would let readers actually pick a runtime for their workload — and where our benchmark dataset plans to expand:
- vLLM vs SGLang on identical hardware across 4 traffic shapes (diverse / shared-prefix / structured / agent-loop)
- Single-node TP=4 throughput across vLLM / SGLang / TensorRT-LLM on 4x H100 with the same Llama-3.1-70B AWQ checkpoint
- Apple Silicon: MLX-LM single-Mac vs Exo two-Mac cluster on DeepSeek V3 with Thunderbolt 5 RDMA
- Petals public swarm Llama-3.1-70B latency distribution (TTFT and per-token) over a week of hourly probes
- Ollama vs LM Studio vs llama.cpp server on identical consumer hardware (M3 Max 64GB, RTX 4090) — apples-to-apples for the desktop tier
How this map updates
This page reads its zones live from the catalog. New runtimes land in scripts/seed/tools.ts and show up here on the next deploy when their slug is referenced in a zone above. Editorial framing — zone titles, blurbs, “what changed this month,” ecosystem winners / declining runtimes — is hand-written and refreshed on the first business day of each month. Inclusion bar: a runtime has to be one we've actually used and can write operator notes about; we don't list everything that lands on GitHub.
Going deeper
- What distributed inference actually is — architectural depth on the multi-machine end of the spectrum.
- vLLM operational review and SGLang operational review — operator-grade detail on the two production defaults.
- Local AI agent ecosystem — where these runtimes plug into the broader agent map.
- MCP ecosystem — the protocol layer most production runtimes expose to clients.