Ecosystem map · Updated May 6, 2026

The local AI inference runtime landscape

Six zones covering every runtime that hosts LLM weights and produces tokens — desktop locals, high-throughput servers, Apple Silicon, quantized engines, distributed systems, and the agent/runtime bridges that turn an engine into an application. Read /systems/distributed-inference for the architectural deep dive on the multi-machine end of the spectrum.

By Fredoline Eruo · Reviewed monthly

Desktop / single-user runtimes

The runtimes you install on a laptop and forget about. Ollama is the curated default; llama.cpp is the engine underneath; LM Studio is the GUI-first alternative; Llamafile is the zero-install single-binary path. The 90% case for individual developers — start here unless you have specific reasons to pick something else.

High-throughput serving runtimes

Production-scale GPU serving. vLLM is the ecosystem default; SGLang is the credible challenger on shared-prefix workloads; TensorRT-LLM is NVIDIA's first-party compiled engine for absolute lowest latency; TGI is the HuggingFace-tied option that vLLM ate the lunch of through 2024-2025; LocalAI is the multi-modal multiplexer. Pick by traffic shape and hardware commitment.

Apple Silicon runtimes

The Apple-Silicon-native path. MLX-LM is now competitive with llama.cpp Metal on M-series hardware, with stronger long-context performance. The 2026 unlock here was Thunderbolt 5 + macOS 26.2 RDMA, which reshapes what's possible for multi-Mac clusters.

Quantized inference runtimes

The runtimes optimized around specific quantization formats. ExLlamaV2 dominates EXL2-format inference on consumer NVIDIA cards (single-card 4090 / 5090 throughput champion). TabbyAPI is its OpenAI-compatible HTTP wrapper. The pick when you've committed to EXL2 quants and want maximum tokens-per-second on one card.

Distributed serving systems

The category nobody covers well online. Exo is the credible Apple-Silicon-cluster path (Thunderbolt 5 RDMA changed the game). Petals is the BitTorrent-style internet swarm (slow but possible for models you can't fit anywhere). Ray Serve is the K8s-grade orchestrator for multi-node vLLM/SGLang. Hyperspace is the consumer P2P entrant. See /systems/distributed-inference for the architecture deep dive.

Agent / runtime bridges

Not runtimes themselves — but the layer that turns a runtime into an actually-usable application. Open WebUI is the production-grade chat frontend; AnythingLLM is the RAG-workspace front door. Both speak any OpenAI-compatible runtime; both extend the runtime's reach into real workflows.

Ecosystem winners

The runtimes that compounded their ecosystem leads through the 2025-2026 cycle:

  • Ollama is now the default first-pull tool for every newcomer to local AI. The curated model library and zero-config setup beat every alternative on time-to-first-token.
  • vLLM remains the production-default GPU serving engine. PagedAttention turned KV-cache efficiency from a research footnote into a 5-24x throughput delta, and the project's discipline through the cycle (continuous batching, prefix caching, chunked prefill, multi-LoRA, speculative decoding) widens the moat.
  • llama.cpp is the bedrock most other runtimes sit on. Ollama wraps it; LM Studio bundles it; Llamafile ships it as one binary. Every quant kernel improvement propagates to all of them.
  • MLX-LM caught up to llama.cpp on Apple Silicon and surpassed it on long-context. The 2026 Thunderbolt 5 unlock makes MLX the natural winner of the Apple-cluster category as Exo expands.

Declining runtimes

Runtimes whose ecosystem position eroded through the cycle:

  • Text Generation Inference (TGI) was the production default in 2023-2024; vLLM ate that lunch through 2024-2025. New deployments default to vLLM unless HuggingFace Hub integration matters specifically.
  • FasterTransformer (NVIDIA's pre-TensorRT-LLM kernel library) faded through 2024 as TensorRT-LLM took its place. Largely a historical reference now.

Benchmark opportunities

The measurements that would let readers actually pick a runtime for their workload — and where our benchmark dataset plans to expand:

  • vLLM vs SGLang on identical hardware across 4 traffic shapes (diverse / shared-prefix / structured / agent-loop)
  • Single-node TP=4 throughput across vLLM / SGLang / TensorRT-LLM on 4x H100 with the same Llama-3.1-70B AWQ checkpoint
  • Apple Silicon: MLX-LM single-Mac vs Exo two-Mac cluster on DeepSeek V3 with Thunderbolt 5 RDMA
  • Petals public swarm Llama-3.1-70B latency distribution (TTFT and per-token) over a week of hourly probes
  • Ollama vs LM Studio vs llama.cpp server on identical consumer hardware (M3 Max 64GB, RTX 4090) — apples-to-apples for the desktop tier

How this map updates

This page reads its zones live from the catalog. New runtimes land in scripts/seed/tools.ts and show up here on the next deploy when their slug is referenced in a zone above. Editorial framing — zone titles, blurbs, “what changed this month,” ecosystem winners / declining runtimes — is hand-written and refreshed on the first business day of each month. Inclusion bar: a runtime has to be one we've actually used and can write operator notes about; we don't list everything that lands on GitHub.

Going deeper