RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Maps
  4. /Inference runtimes (May 2026)
Ecosystem map · Updated May 6, 2026

The local AI inference runtime landscape

Six zones covering every runtime that hosts LLM weights and produces tokens — desktop locals, high-throughput servers, Apple Silicon, quantized engines, distributed systems, and the agent/runtime bridges that turn an engine into an application. Read /systems/distributed-inference for the architectural deep dive on the multi-machine end of the spectrum.

By Fredoline Eruo · Reviewed monthly
ℹWhat changed this month
  • Exo + Thunderbolt 5 RDMA went mainstream. macOS 26.2 RDMA over Thunderbolt 5 cut inter-Mac latency by ~99% on M4 Pro+ hardware. DeepSeek V3 671B running at 5.37 tok/s on 8x M4 Pro Mac Minis is now a credible personal-cluster benchmark, not a tech demo.
  • vLLM 0.17.1 (March 2026) shipped Model Runner V2 with up to 56% higher throughput on GB200. The production default keeps widening its lead on Hopper / Blackwell.
  • SGLang aggregate-cluster wins shipped. Cross-replica RadixAttention sync makes SGLang's architectural advantage compound at multi-node scale, not just per-replica. The vLLM↔SGLang choice is now meaningfully workload-shaped at every cluster size.
  • TGI continues declining. The 2023-2024 production default; in 2026, new deployments default to vLLM unless HuggingFace Hub integration matters specifically.
Zones
  1. Desktop / single-user runtimes
  2. High-throughput serving runtimes
  3. Apple Silicon runtimes
  4. Quantized inference runtimes
  5. Distributed serving systems
  6. Agent / runtime bridges

Desktop / single-user runtimes

The runtimes you install on a laptop and forget about. Ollama is the curated default; llama.cpp is the engine underneath; LM Studio is the GUI-first alternative; Llamafile is the zero-install single-binary path. The 90% case for individual developers — start here unless you have specific reasons to pick something else.

runnerOSS4.7/5

Ollama

★ 130k

The default first-pull tool for local AI. One-line model installs (`ollama run llama3.1`), an OpenAI-compatible HTTP API, good defaults out of the box. Built on llama.cpp.

runnerOSS4.6/5

llama.cpp

★ 90k

The bedrock of local LLM inference. Most other tools wrap or embed it. Maximum control, maximum platform support, sharpest learning curve.

gui4.5/5

LM Studio

Polished desktop GUI for local LLMs. Built-in HuggingFace search, OpenAI-compatible local server, side-by-side conversations.

runnerOSS4.4/5

Llamafile

★ 22k

Mozilla's single-binary llama.cpp distribution. Download one file, run on any OS without dependencies.

High-throughput serving runtimes

Production-scale GPU serving. vLLM is the ecosystem default; SGLang is the credible challenger on shared-prefix workloads; TensorRT-LLM is NVIDIA's first-party compiled engine for absolute lowest latency; TGI is the HuggingFace-tied option that vLLM ate the lunch of through 2024-2025; LocalAI is the multi-modal multiplexer. Pick by traffic shape and hardware commitment.

serverOSS4.8/5

vLLM

★ 50k

High-throughput inference engine with PagedAttention, continuous batching, and tensor + pipeline parallelism. The reference deployment runtime when you've outgrown llama.cpp / Ollama for production se

serverOSS

SGLang

★ 13k

Structured generation language + runtime for LLM programs. RadixAttention reuses KV cache across prompts with shared prefixes — significant throughput wins for agent workloads where many tool calls sh

serverOSS4.3/5

TensorRT-LLM

★ 12k

NVIDIA's first-party inference compiler. Generates optimized engines per model + GPU pair, with the lowest latency on NVIDIA hardware. The pick when you're committed to a single SKU and need the absol

serverOSS4.2/5

Text Generation Inference (TGI)

★ 10k

HuggingFace's production inference server. Slightly behind vLLM on raw throughput but tighter integration with the HF ecosystem.

serverOSS

LocalAI

★ 35k

OpenAI-API-compatible drop-in for self-hosted inference, with a multi-backend twist: the same endpoint can serve LLMs (llama.cpp / vLLM under the hood), embeddings, image gen (stable-diffusion.cpp), a

Apple Silicon runtimes

The Apple-Silicon-native path. MLX-LM is now competitive with llama.cpp Metal on M-series hardware, with stronger long-context performance. The 2026 unlock here was Thunderbolt 5 + macOS 26.2 RDMA, which reshapes what's possible for multi-Mac clusters.

runnerOSS4.5/5

MLX-LM

★ 4k

Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with better long-context performance.

Quantized inference runtimes

The runtimes optimized around specific quantization formats. ExLlamaV2 dominates EXL2-format inference on consumer NVIDIA cards (single-card 4090 / 5090 throughput champion). TabbyAPI is its OpenAI-compatible HTTP wrapper. The pick when you've committed to EXL2 quants and want maximum tokens-per-second on one card.

runnerOSS4.4/5

ExLlamaV2

★ 5k

Hand-optimized inference for EXL2-quantized models. Fastest single-GPU runtime for the EXL2 quant format on Ada/Hopper hardware. Lower-level than llama.cpp; pairs with text-generation-webui + TabbyAPI

serverOSS

TabbyAPI

★ 2k

OpenAI-API frontend for ExLlamaV2. Wraps the EXL2 inference engine in a clean HTTP API, adds streaming, batching, and OAI-compatible chat templates. The default front-of-house when you've already comm

Distributed serving systems

The category nobody covers well online. Exo is the credible Apple-Silicon-cluster path (Thunderbolt 5 RDMA changed the game). Petals is the BitTorrent-style internet swarm (slow but possible for models you can't fit anywhere). Ray Serve is the K8s-grade orchestrator for multi-node vLLM/SGLang. Hyperspace is the consumer P2P entrant.

/systems/distributed-inference
serverOSS

Exo

★ 28k

Personal AI cluster software. Auto-discovers Apple Silicon devices on a LAN and shards a model across them via pipeline + tensor parallelism on top of MLX. The 2026 unlock: Thunderbolt 5 + macOS 26.2

serverOSS

Petals

★ 10k

BitTorrent-style decentralized LLM inference. Splits a model into transformer-block shards distributed across volunteer hosts on the public internet — one client runs the input/output layers locally a

orchestratorOSS

Ray Serve

★ 33k

Distributed model serving on top of Ray. Lets you stitch vLLM / SGLang / custom runtimes into a multi-replica, multi-model deployment with autoscaling, traffic splitting, and pipeline composition. The

serverOSS3.9/5

Hyperspace (P2P inference network)

★ 12k

Decentralized peer-to-peer AI inference network. 2.7M+ CLI downloads, 2M+ active nodes globally as of April 2026. Three-tier model routing (local registry → DHT → gossip broadcast) supports any GGUF m

Agent / runtime bridges

Not runtimes themselves — but the layer that turns a runtime into an actually-usable application. Open WebUI is the production-grade chat frontend; AnythingLLM is the RAG-workspace front door. Both speak any OpenAI-compatible runtime; both extend the runtime's reach into real workflows.

guiOSS4.4/5

AnythingLLM

★ 32k

Document-oriented LLM frontend with workspaces. Connects to Ollama, LM Studio, OpenAI, Anthropic, etc. Strong document RAG.

Ecosystem winners

The runtimes that compounded their ecosystem leads through the 2025-2026 cycle:

  • Ollama is now the default first-pull tool for every newcomer to local AI. The curated model library and zero-config setup beat every alternative on time-to-first-token.
  • vLLM remains the production-default GPU serving engine. PagedAttention turned KV-cache efficiency from a research footnote into a 5-24x throughput delta, and the project's discipline through the cycle (continuous batching, prefix caching, chunked prefill, multi-LoRA, speculative decoding) widens the moat.
  • llama.cpp is the bedrock most other runtimes sit on. Ollama wraps it; LM Studio bundles it; Llamafile ships it as one binary. Every quant kernel improvement propagates to all of them.
  • MLX-LM caught up to llama.cpp on Apple Silicon and surpassed it on long-context. The 2026 Thunderbolt 5 unlock makes MLX the natural winner of the Apple-cluster category as Exo expands.

Declining runtimes

Runtimes whose ecosystem position eroded through the cycle:

  • Text Generation Inference (TGI) was the production default in 2023-2024; vLLM ate that lunch through 2024-2025. New deployments default to vLLM unless HuggingFace Hub integration matters specifically.
  • FasterTransformer (NVIDIA's pre-TensorRT-LLM kernel library) faded through 2024 as TensorRT-LLM took its place. Largely a historical reference now.

Benchmark opportunities

The measurements that would let readers actually pick a runtime for their workload — and where our benchmark dataset plans to expand:

  • vLLM vs SGLang on identical hardware across 4 traffic shapes (diverse / shared-prefix / structured / agent-loop)
  • Single-node TP=4 throughput across vLLM / SGLang / TensorRT-LLM on 4x H100 with the same Llama-3.1-70B AWQ checkpoint
  • Apple Silicon: MLX-LM single-Mac vs Exo two-Mac cluster on DeepSeek V3 with Thunderbolt 5 RDMA
  • Petals public swarm Llama-3.1-70B latency distribution (TTFT and per-token) over a week of hourly probes
  • Ollama vs LM Studio vs llama.cpp server on identical consumer hardware (M3 Max 64GB, RTX 4090) — apples-to-apples for the desktop tier

How this map updates

This page reads its zones live from the catalog. New runtimes land in scripts/seed/tools.ts and show up here on the next deploy when their slug is referenced in a zone above. Editorial framing — zone titles, blurbs, “what changed this month,” ecosystem winners / declining runtimes — is hand-written and refreshed on the first business day of each month. Inclusion bar: a runtime has to be one we've actually used and can write operator notes about; we don't list everything that lands on GitHub.

Going deeper

  • What distributed inference actually is — architectural depth on the multi-machine end of the spectrum.
  • vLLM operational review and SGLang operational review — operator-grade detail on the two production defaults.
  • Local AI agent ecosystem — where these runtimes plug into the broader agent map.
  • MCP ecosystem — the protocol layer most production runtimes expose to clients.