95 runtimes reviewed. Runners, GUIs, and servers for every workflow.
TurboVec is an open-source, **local-first vector index** (Rust core + Python bindings) by Ryan Codrai, MIT-licensed, built on Google Research's **TurboQuant** quantizer (presented at ICLR 2026). Its pitch for local AI: f
Open-source ChatGPT clone with multi-provider support (OpenAI, Anthropic, local LLMs via OpenAI-compatible APIs). The most popular self-hosted ChatGPT-shaped frontend. Strong multi-user + RAG + plugin support; pairs well
Document-oriented LLM frontend with workspaces. Connects to Ollama, LM Studio, OpenAI, Anthropic, etc. Strong document RAG.
Structured generation language + runtime for LLM programs. RadixAttention reuses KV cache across prompts with shared prefixes — significant throughput wins for agent workloads where many tool calls share system prompts.
High-throughput inference engine with PagedAttention, continuous batching, and tensor + pipeline parallelism. The reference deployment runtime when you've outgrown llama.cpp / Ollama for production serving. Backed by Any
Intel's inference toolkit. The first-class path for Intel Arc GPUs, Intel NPUs (Lunar Lake / Meteor Lake), and CPU-optimized inference on x86. Ships pre-quantized model variants tuned for Intel hardware via the OpenVINO
Microsoft's cross-platform inference runtime for ONNX models. The reference path when you need a single runtime that targets CUDA + DirectML + CoreML + OpenVINO + ROCm from one binary. Stronger on classical models (visio
Self-hosted ChatGPT-style web frontend. Pairs with Ollama or any OpenAI-compatible backend. Multi-user, RAG built in, fast.
NVIDIA's first-party inference compiler. Generates optimized engines per model + GPU pair, with the lowest latency on NVIDIA hardware. The pick when you're committed to a single SKU and need the absolute fastest tokens-p
Polished desktop GUI for local LLMs. Built-in HuggingFace search, OpenAI-compatible local server, side-by-side conversations.
Hand-optimized inference for EXL2-quantized models. Fastest single-GPU runtime for the EXL2 quant format on Ada/Hopper hardware. Lower-level than llama.cpp; pairs with text-generation-webui + TabbyAPI as front-ends.
AMD's open-source equivalent of NVIDIA CUDA. Required for any meaningful AMD GPU inference on Linux (vLLM, llama.cpp ROCm build, ExLlamaV2). Windows ROCm is improving as of 2026 but still trails Linux. Strix Halo APU + R
Apple's Metal-native ML framework's LLM runner. Now competitive with llama.cpp Metal on M-series silicon, with better long-context performance.
The bedrock of local LLM inference. Most other tools wrap or embed it. Maximum control, maximum platform support, sharpest learning curve.
The default first-pull tool for local AI. One-line model installs (`ollama run llama3.1`), an OpenAI-compatible HTTP API, good defaults out of the box. Built on llama.cpp.
Personal AI agent with a local-first gateway architecture. Connects your local LLMs (Ollama, llama.cpp) to the messaging surfaces you already use — WhatsApp, Telegram, Slack, Discord, iMessage, and 20+ more. The runaway
AI-driven development agent that completes engineering tasks end-to-end — branches, code, PRs. v1.6 added a Planning Mode that drafts a plan before executing. Local-LLM-friendly via Ollama, vLLM, and SGLang. The stronges
Decentralized peer-to-peer AI inference network. 2.7M+ CLI downloads, 2M+ active nodes globally as of April 2026. Three-tier model routing (local registry → DHT → gossip broadcast) supports any GGUF model. The April 2026
Drop-in memory layer for LLM agents. Vector + graph memory variants (Mem0g) — the graph variant builds a directed labeled knowledge graph alongside the vector store, with conflict detection on contradictory facts. Leads
Agent memory framework that models memory like an operating system. Main context = RAM, archival storage = disk; the agent itself decides when to page. Originally MemGPT, now Letta. Model-agnostic (Anthropic, OpenAI, Oll
Open protocol for LLM clients to talk to external tools and data sources. The 'USB-C for AI' that became the default in 2026 — supported by Anthropic, OpenAI, and Google DeepMind, with 500+ public MCP servers covering Gi
Open-source extensible AI agent now governed by the Agentic AI Foundation (AAIF) at the Linux Foundation. Started inside Block (formerly Square). 25+ provider support including Ollama, Ramalama, Docker Model Runner. Best
Open-source AI dev-team extension for VS Code (1.55M installs, 23.8k GitHub stars). **Discontinued: all Roo Code products — Extension, Cloud, and Router — shut down on May 15, 2026** with refunds for unused balances. The
Anthropic's official desktop app for Claude. Native MCP server support means you can plug in local file access, GitHub, and custom tools. Distinct from the Claude Code CLI.
Inflection AI's consumer assistant — voice-first, conversational, designed for personal use rather than coding. Powered by Inflection-2.5.
High-performance native editor from the Atom team, with built-in AI panel and inline assistant. BYO API key for any provider.
Sourcegraph's AI assistant. Strong at large-codebase context retrieval thanks to the underlying Sourcegraph index.
JetBrains' first-party AI for IntelliJ, PyCharm, WebStorm, etc. Multi-LLM backend (OpenAI, Anthropic, Gemini, local).
Replit's full-stack scaffolder agent. Goes from prompt to deployed app on Replit's hosted runtime.
Cognition Labs' fully autonomous SWE agent. Cloud-only, browser interface, longest task horizons. Premium pricing.
Factory's autonomous SWE agent. Operates over GitHub PRs, Slack, Linear. Targets the long-running multi-file change workflow.
Codeium's AI-native IDE (formerly known as Codeium). Cascade agent, supercomplete, and a generous free tier.
Open-source VS Code and JetBrains assistant. Configurable autocomplete + chat + agent modes. Strong with local Ollama backends.
VS Code agent — 1.5M users in 2026, supports 500+ models, charges zero markup over upstream API costs. Cline lineage with Roo Code's diff approach.
VS Code extension agent — ~4M installs in 2026. Plan/Act mode, autonomous file edits with diff approval, terminal access. The leading open-source IDE agent.
Open-source terminal coding agent built by the SST team. TUI-first, BYO LLM, MCP-compatible. A Claude-Code-style workflow without the Anthropic lock-in.
Terminal-based AI pair programmer. Run in your project directory, describe a change, it edits files and creates meaningful git commits. Works with any LLM — local Ollama, Anthropic, OpenAI, etc.
Open-source CLI client for the new Codex agent. Local CLI that orchestrates cloud Codex models against your file tree.
OpenAI's 2025 coding agent (the new Codex, distinct from the deprecated 2021 model). Cloud task-runner pattern: hand it a multi-step task, it works in a sandbox and returns a PR.
GitHub's incumbent AI assistant. VS Code, JetBrains, Neovim integrations. Lost some inline-completion mindshare to Cursor and agentic mindshare to Claude Code, but still the easiest enterprise rollout via GitHub.
Anysphere's AI-native IDE. Forks VS Code with Cursor Tab inline completion, agentic chat, and background agents. Best 'flow' for inline completion in 2026.
Anthropic's terminal-native coding agent. Tops SWE-bench Verified at 87.6% and SWE-bench Pro at 64.3% in 2026. Deep MCP integration, agentic file editing, and a $20/mo Pro tier are the standout signals.
Character-driven LLM frontend originally for role-play; widely used for any persona-driven workflow. Supports OpenAI, KoboldAI, llama.cpp, Ollama, Aphrodite, oobabooga endpoints. Rich sampling controls, character cards,
Microsoft's DirectX 12 inference backend. The Windows-native path for AMD / Intel / Qualcomm GPU + NPU acceleration without ROCm or vendor-specific SDKs. Used through ONNX Runtime as the DML execution provider.
vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some through
Intel's PyTorch extension for low-bit LLM inference on Intel GPUs / CPUs / NPUs. Strongest community-supported path for running LLMs on Intel Arc A770 / B580 and on Lunar Lake NPUs. Compatible with Hugging Face Transform
Python bindings for llama.cpp with an OpenAI-compatible HTTP server. The fastest path from `pip install` to a working local-LLM endpoint. Ships pre-built wheels with optional CUDA / Metal / ROCm / Vulkan support.
Specialized transformer inference engine. The reference runtime for Whisper (faster-whisper), NLLB translation, and other encoder-decoder models. Out-of-the-box INT8 quantization with strong CPU performance.
Qualcomm's official on-device-AI compiler + model zoo for Snapdragon NPU targets. Pre-quantized model variants for Llama, Phi, Gemma, Qwen running on Hexagon NPU. The reference path for Android NPU acceleration in 2025-2
Apple's Swift bindings for MLX. The native iOS / iPadOS path for on-device LLM inference. Apple-published example apps demonstrate Llama 3.2, Phi-3.5, Qwen 2.5 running on iPhone 15 Pro+ at usable rates.
Microsoft's mobile/edge variant of ONNX Runtime. The reference path for Snapdragon X / Lunar Lake / Ryzen AI on Windows + Copilot+ PC NPU acceleration. Mobile builds drop ops not used in inference to keep binary size sma
TVM-based LLM compilation framework. Compiles models for any GPU with a Vulkan / Metal / WebGPU / CUDA backend. The most-deployed cross-platform on-device LLM runtime — runs Llama, Phi, Gemma, Qwen on phones, browsers, a
PyTorch's official mobile / edge inference runtime. Compiles PyTorch models to a mobile-optimized format for Android (NNAPI / GPU / NPU) and iOS (Metal / CoreML). The successor to the deprecated PyTorch Mobile path.
MCP server wrapping Firecrawl — a managed crawler that handles JavaScript rendering, anti-bot evasion, and large-site map+scrape jobs at scale. The pragmatic upgrade from mcp-server-fetch when an agent needs to crawl tho
Reference MCP server that gives an agent a structured scratchpad for multi-step reasoning. Each call records a numbered thought with revision and branching support — the agent can backtrack, fork, and consolidate plans w
Reference MCP server for local Git repository operations. Status, diff, log, blame, branch listing — read-side operations against a checked-out repo without round-tripping to GitHub. Pairs with mcp-server-filesystem to g
Reference MCP server for fetching and converting web content. Pulls a URL, runs HTML through a readability extractor, returns markdown the model can chunk and reason over. The lightweight web-reader pair to Brave Search
Reference MCP server that gives an agent a persistent knowledge graph — entities, relations, observations stored to disk and surfaced back across sessions. The simplest path to making an agent remember context between co
Reference MCP server wrapping the Brave Search API. Privacy-respecting alternative to Google/Bing endpoints — Brave does not maintain a personal-history-linked index. The default web-search MCP in the Anthropic reference
Microsoft's MCP server that drives a real browser via Playwright — Chromium, Firefox, and WebKit. Ships ~22 tools that operate against the page's accessibility tree rather than pixel coordinates, which is dramatically mo
Reference MCP server that exposes a Postgres database as a query surface. Read-only by default — but worth flagging that early versions had a SQL-injection class issue where the read-only wrapper could be bypassed by sta
GitHub's first-party MCP server. Surfaces issues, pull requests, code search, file contents, repo metadata, Actions runs, and discussions through the protocol. Now maintained by GitHub itself rather than the original Ant
Anthropic's reference MCP server for filesystem access. Read, write, search, move, and list files inside a configured allowlist of directories. The canonical example for understanding how MCP tool exposure works in pract
Open-source LLM tracing + evaluation. OpenInference standard for traces; runs locally with one pip install. The OSS-first pick for teams that want LangSmith-shaped functionality without vendor lock-in.
LangChain's observability + evaluation platform. Trace agent runs, run evaluators against benchmark suites, version prompts. The dominant trace+eval tool for the LangChain/LangGraph ecosystem.
OpenAI-API frontend for ExLlamaV2. Wraps the EXL2 inference engine in a clean HTTP API, adds streaming, batching, and OAI-compatible chat templates. The default front-of-house when you've already committed to the EXL2 qu
OpenAI-API-compatible drop-in for self-hosted inference, with a multi-backend twist: the same endpoint can serve LLMs (llama.cpp / vLLM under the hood), embeddings, image gen (stable-diffusion.cpp), audio (whisper.cpp),
Personal AI cluster software. Auto-discovers Apple Silicon devices on a LAN and shards a model across them via pipeline + tensor parallelism on top of MLX. The 2026 unlock: Thunderbolt 5 + macOS 26.2 RDMA dropped inter-d
BitTorrent-style decentralized LLM inference. Splits a model into transformer-block shards distributed across volunteer hosts on the public internet — one client runs the input/output layers locally and streams activatio
Distributed model serving on top of Ray. Lets you stitch vLLM / SGLang / custom runtimes into a multi-replica, multi-model deployment with autoscaling, traffic splitting, and pipeline composition. The orchestration layer
Long-term memory platform for AI agents. Sits above Graphiti as the application layer — sessions, facts, summaries, vector + graph hybrid retrieval. The 'memory backend you don't have to build' choice.
Temporal graph memory framework. Builds a bi-temporal knowledge graph from agent conversations, tracking when each fact was learned and when it was true. Powers Zep's hosted offering.
Neo4j's official GraphRAG toolkit — Python library + reference patterns for building retrieval-augmented generation against a knowledge graph. The mature pick for enterprises already running Neo4j.
Vector search inside the same Redis you already run. HNSW + flat indices, hybrid filtering with FT.SEARCH. The pragmatic pick when you don't want to add another service to ops.
Distributed vector database designed for billion-scale workloads. Compute-storage separation, GPU-accelerated index builds, multi-tenant from the ground up. The pick when you've outgrown Qdrant single-node.
Embedded vector + columnar database. Lance file format reads serverless from S3/local disk; no separate process to run. The pick for embedded apps and notebook workflows.
Vector database with built-in modules for embedding, generative search, and reranking. Schema-first design appeals to teams used to traditional databases. Generative-search module pairs with local Ollama models out of th
Vector database written in Rust. Strong filtering (payload-based pre-filter), HNSW index with quantization variants, gRPC + REST APIs. The performance pick when you cross 10M vectors.
Open-source embedding database for LLM applications. The default 'just install pip and start' vector store for prototypes, with first-party clients in Python and JS. SQLite-backed locally, distributed mode in cloud.
2x faster QLoRA fine-tuning with hand-tuned Triton kernels. Free OSS for single-GPU; commercial Pro for multi-GPU.
YAML-config fine-tuning framework. Reference toolkit for the open fine-tuning community (Hermes, Dolphin, etc. all use it).
The CLI for the world's model hub. `hf download`, `hf upload`, model card editing.
The original Stable Diffusion frontend. Less actively developed in 2026 than ComfyUI but still has the cleanest UX for simple gen.
Node-graph image-generation UI. Standard for Stable Diffusion and Flux workflows. Endlessly customizable.
Browser-style app launcher for AI tools. One-click installs of ComfyUI, oobabooga, RVC, and many other AI apps.
Lets LLMs execute code locally — Python, shell, AppleScript. The original 'Code Interpreter on your machine'. Useful for automation tasks.
Python/JS framework focused on RAG and document indexing. Cleaner than LangChain for retrieval-heavy use cases.
Python/JS framework for chains, agents, and RAG. Batteries-included but heavyweight; many graduate to LangGraph or DIY.
One of the original local-LLM apps from Nomic. Privacy-focused, runs on CPU, decent model library. Pace of development has slowed compared to Jan/Msty.
Cross-platform desktop client supporting local and cloud models in one window. Strong on knowledge-stack RAG.
Open-source desktop ChatGPT alternative. Privacy-first, runs offline, supports Hugging Face import.
The 'AUTOMATIC1111 of LLMs'. Kitchen-sink Gradio UI with multi-backend support and a big extension ecosystem.
Single-file llama.cpp distribution focused on roleplay and creative writing. Bundles a web UI, image gen, and the Kobold API.
Mozilla's single-binary llama.cpp distribution. Download one file, run on any OS without dependencies.
HuggingFace's production inference server. Slightly behind vLLM on raw throughput but tighter integration with the HF ecosystem.