llama.cpp
The bedrock of local LLM inference. Most other tools wrap or embed it. Maximum control, maximum platform support, sharpest learning curve.
llama.cpp is the engine that turned local LLM inference from a research curiosity into something a hobbyist could run on a laptop. Georgi Gerganov's C++ implementation of LLaMA inference — originally a weekend port of Meta's PyTorch model to plain C — became the dominant local-AI execution layer almost by accident. Today it's the inference layer underneath Ollama, LM Studio, and most of the consumer-grade local-AI ecosystem. The operator-grade question isn't "is llama.cpp good?" — yes, it's the foundation — but "when do you use llama.cpp directly versus via a wrapper?"
Architecture and what llama.cpp actually is
llama.cpp is a C++ library + a set of CLI tools (llama-cli, llama-server, llama-bench, llama-quantize) plus the GGUF model file format that's become the de-facto standard for portable quantized weights. The open-source repo carries 90k+ stars and accepts contributions at high velocity — daily merges to master are routine.
Architecturally, llama.cpp is what other runtimes wrap. Ollama vendors a llama.cpp build and exposes a friendly daemon. LM Studio bundles llama.cpp under a desktop GUI. KoboldCpp is a llama.cpp fork with chat extensions. So when you choose llama.cpp directly, you're choosing maximum control + minimum abstraction — and accepting the operator burden of the build flags, runtime flags, and per-model tuning that the wrappers normally hide.
The execution model: load a GGUF file into RAM (or VRAM if you have a GPU backend compiled), then run autoregressive decoding via vectorized matmul kernels tuned per backend. The library supports CPU (with AVX2 / AVX512 / NEON / AMX), NVIDIA CUDA, AMD ROCm + HIP, Apple Metal, Intel Vulkan, and Sycl. Different backends ship at different maturity levels — see the compatibility matrix below.
Local stack compatibility
llama.cpp's backend story is the broadest of any local-AI runtime: it runs on more hardware than any competitor, and its CPU fallback is the gold standard for "I just want this to work on whatever I have." But "supports a backend" and "the backend is well-tuned" are different statements. Apple Metal + NVIDIA CUDA are the production paths. ROCm has matured but lags CUDA in flash-attention coverage. Vulkan is the universal fallback for GPUs without a first-class path (Intel Arc, older AMD, NVIDIA on systems where CUDA-build is impractical). For runtime-runtime tradeoffs see /compare/engines/ollama-vs-llama-cpp, /compare/engines/vllm-vs-llama-cpp, and /compare/engines/mlx-vs-llama-cpp.
The compatibility matrix below ranks each backend's operator readiness in 2026.
Setup + day-1 reality
Three install paths, ranked by friction:
- Pre-built binary (
brew install llama.cppon macOS, package manager on Linux distros). Lowest friction; you get the default backend (Metal on Mac, CPU + AVX on Linux). Works for 90% of getting-started use cases. - Build from source with backend flag (
cmake -B build -DGGML_CUDA=ON && cmake --build build). Required when you want CUDA / ROCm / Vulkan / Sycl. The build needs the matching toolchain (CUDA Toolkit / ROCm / Vulkan SDK) installed and visible to CMake. This is where most operator pain happens: cmake flags drift across releases, and a recipe that worked 6 months ago may not work on master today. - Pre-built CUDA binary (released on GitHub for major versions). Acceptable for stable production-ish use but lags master by days-to-weeks.
Once you have a binary, three CLI tools matter: llama-cli for one-shot chat / completion, llama-server for an HTTP API on localhost:8080 (with OpenAI-compat endpoints under /v1/...), and llama-bench for reproducible per-prompt token-throughput measurement. The benchmark tool is meaningfully better than what most other runtimes ship — see the benchmark methodology checklist for how to use it correctly.
GGUF files: download from Hugging Face (the bartowski account is the de-facto canonical quantizer for new models in 2026; older quants live under TheBloke). Place anywhere accessible; pass -m path/to/model.gguf to any CLI tool.
Operational concerns
- Build flag drift. A working CMake recipe from 6 months ago may fail on current master. The repo's docs/build.md is authoritative; community blog posts go stale fast.
- Master-vs-release versioning. llama.cpp's release cadence is high — the team tags new releases roughly weekly. Daily master commits are usually safe, but for production pin to a release tag.
- Sampler defaults. llama.cpp's defaults differ from upstream model card recommendations. For accurate inference, pass
-p,--temp,--top-p,--top-k,--repeat-penaltyexplicitly per model card. - No native daemon lifecycle. llama-server is a foreground process; you wrap it in systemd / launchd / Docker yourself. This is a feature, not a gap — but operators new to llama.cpp expect Ollama-style daemon behavior and are surprised.
- GGUF version migrations. When the GGUF spec adds fields (it does, periodically), older quants stop loading on newer llama.cpp. Re-download from bartowski.
Performance reality
llama.cpp's tuning is the floor of what any GGUF-based runtime can achieve. Wrappers (Ollama, LM Studio) match performance closely. Direct-use llama.cpp can outperform wrappers by 5-10% via flag tuning (-fa for flash-attention on supported backends, -ngl 999 to push all layers to GPU, --no-mmap on systems where mmap behaves badly under memory pressure). Not life-changing, but real. Single-stream tok/s comparable to Ollama's because Ollama IS llama.cpp underneath.
For multi-user concurrent serving, llama.cpp is wrong. The architecture serializes generation. For >1 concurrent user, see vLLM or SGLang.
Failure modes (what breaks)
The operator-grade list, ranked by community-benchmark error frequency:
- CUDA OOM at long context. Setting
-n 8192or higher on a 24GB card with a 70B Q4 model exhausts the KV cache. Pre-compute KV memory: 70B Q4 + 8K context ≈ 4 GB cache + 40 GB weights + overhead. You need 48GB+ for 70B at 8K. - Wrong CMake backend for your hardware. Operators running CPU when they expected GPU because
-DGGML_CUDA=ONwasn't passed. Always check llama-cli's startup banner for the active backend. - flash-attention mismatch.
-faworks on CUDA + Metal but not all ROCm versions. If you see "flash attention not supported," drop the flag. - Tokenizer drift on third-party GGUF quants. Some uploaders (rare, but real) ship GGUFs with subtly wrong tokenizer config. Output looks plausible but diverges from upstream model. Use bartowski / official-org quants when possible.
- Memory mapping vs RAM tradeoffs. Default
mmapbehavior loads weights lazily — first inference is slow, subsequent ones fast. With--no-mmapyou load eagerly (slow startup, fast first inference). Operators occasionally pick wrong for their use case. - Multi-GPU layer-split miscount.
-ngl Npushes the first N layers to GPU; rest stay on CPU. Setting-ngl 999(all layers) is usually right; setting it to a small number when you have GPU headroom is silent under-utilization.
How llama.cpp compares
Compared to Ollama: same engine. Ollama wraps llama.cpp with daemon + curated model library + sane defaults. Use llama.cpp directly when you need build-flag control, custom sampling, or grammars (GBNF). Use Ollama when you want it to just work.
Compared to vLLM: vLLM is a different architecture entirely — paged attention, continuous batching, tensor-parallel multi-GPU. vLLM dominates production multi-user serving. llama.cpp dominates single-user laptop / homelab inference. They're not really competitors; they live at different points in the operator's career.
Compared to MLX-LM: MLX is Apple Silicon native, no GGUF, faster on-device on M-class hardware for many workloads. llama.cpp's Metal backend is competitive but MLX often edges it on M1-M3 by 5-15%. On M4 the gap narrows. If you're Apple-only and chasing every tok/s, try MLX. If you want one binary that runs everywhere, llama.cpp is right.
Compared to ExLlamaV2: ExLlamaV2 is NVIDIA-only and chases maximum-throughput single-GPU inference of EXL2-quantized models. It outpaces llama.cpp on a 24GB consumer card for the specific scenario it targets (4-5 bpw EXL2, single-user, NVIDIA). If your use case fits, it's faster. Otherwise llama.cpp's portability wins.
Deployment paths
Three operator-grade deployment shapes are documented in the structured deployment-paths section below: build-from-source homelab path (max control, accept the build burden), llama-server with reverse proxy (HTTP API serving a small team), and pre-built binary daily-driver (CLI use for a single operator). Each card under this review shows hardware + complexity + when it fits.
Editorial verdict
llama.cpp is the foundation. It's not always the right user-facing surface — Ollama is for newcomers, vLLM is for production serving, MLX-LM is for Apple-only — but underneath those wrappers, the engine that delivers the tokens is llama.cpp. Use it directly when you've outgrown the wrappers' opinions or need control they don't expose. Don't use it directly when the wrapper would suffice and operator-time matters more than the last 5-10% of throughput.
The release cadence + community velocity means llama.cpp keeps improving faster than any competitor. The team has shipped meaningful performance wins (KV-cache reuse, flash-attention, speculative decoding, quant precision) on a near-monthly basis since 2023. That's compounding, and it's why even production-grade alternatives ship llama.cpp under the hood.
Last reviewed 2026-05-08 by RunLocalAI editorial. Reproduce or correct: /submit/feedback.
| Status | Runtime / Stack | Notes |
|---|---|---|
| Excellent | Apple Silicon (M1-M4, Metal) | First-class Metal backend, native autovectorization on AMX. The reference implementation for Apple-side tuning. Sweet spot for 7B-13B at conversational latencies; M-Max + 64GB unified memory comfortably runs 70B Q4. flash-attention available. |
| Excellent | NVIDIA CUDA (RTX 30/40/50) | Production CUDA path with flash-attention + speculative decoding + tensor-parallel. Build with -DGGML_CUDA=ON. The fastest path on consumer NVIDIA except where ExLlamaV2 wins for EXL2-specific scenarios. |
| Good | AMD ROCm (RX 7000 / 9000 / Instinct) | ROCm 6+ supported via -DGGML_HIP=ON. Per-feature gaps: flash-attention available on most consumer GPUs, less consistent on Instinct. Worth verifying current state per release. |
| Excellent | CPU-only (x86_64, AVX2 / AVX512 / AMX) | Reference CPU implementation with hand-tuned SIMD per ISA. The gold standard for laptop / homelab CPU inference. Usable for 7B Q4 on 16GB RAM, 13B Q4 on 32GB. Single-digit tok/s but always works. |
| Good | Intel Vulkan / Sycl | Vulkan compute path is the universal fallback for GPUs without a first-class backend. Works on Intel Arc, older AMD, and even some NVIDIA via SPIR-V. Performance trails CUDA but is usable. |
| Excellent | ARM64 (Apple Silicon CPU, Snapdragon, Graviton) | ARM NEON tuned. Runs well on cloud ARM (Graviton, Ampere Altra) for CPU-only inference. Mobile path (Snapdragon X Elite, etc.) lacks NPU acceleration but CPU performance is honest. |
| Good | Datacenter (H100 / A100 / MI300X) | Runs but underutilizes. Tensor-parallel + paged attention belong to vLLM/SGLang for these cards. Use llama.cpp on datacenter GPUs for ad-hoc dev, single-user inference, or as a portable fallback. |
| Good | Multi-GPU layer-split | -ngl + tensor-split + main-gpu flags work for spilling weights across multiple NVIDIA cards (single-stream serial). Not tensor-parallel — concurrent throughput doesn't scale. Best for fitting one large model across two cards. |
Pre-built binary daily driver
trivialBrew install / pacman install / GitHub release download. Run llama-cli or llama-server directly, point at a GGUF from bartowski, ship. Lowest friction; covers 80% of single-operator use cases. Choose this when you don't need GPU acceleration on a non-Mac (the pre-built binaries default to CPU on Linux / Windows).
Build from source with GPU backend
involvedWhen you need CUDA / ROCm / Vulkan / Sycl acceleration on Linux or Windows. Clone master, install backend toolchain, cmake -B build -DGGML_CUDA=ON (or -DGGML_HIP=ON / -DGGML_VULKAN=ON), cmake --build build -j. Pin to a release tag for production. The reference workflow for max control + max performance + the cost of build-flag drift across releases.
llama-server behind reverse proxy
moderateWhen you want a small-team OpenAI-compatible API but don't yet need vLLM. llama-server runs on :8080 with /v1/chat/completions and /v1/embeddings endpoints. Front it with caddy (auto-TLS) or nginx (manual cert) for HTTPS + basic auth. Wrap in systemd for restart-on-crash. The cliff: real concurrent users (>1) queue and feel slow — when that hurts, migrate to vLLM.
Setup guidance
Install via package manager: macOS brew install llama.cpp, or build from source: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build --config Release. The server binary is at ./build/bin/llama-server. Download a GGUF model from HuggingFace (e.g. hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF). Start: ./build/bin/llama-server -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --port 8080. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. The server exposes OpenAI-compatible /v1/chat/completions and /v1/completions endpoints plus a web UI at http://localhost:8080. First run time is download-only — inference starts instantly after model load (~5–30 seconds depending on model size). llama.cpp runs on CPU, CUDA, Metal (Apple Silicon), Vulkan, SYCL (Intel), and ROCm (AMD) backends — specify via -ngl 99 to offload all layers to GPU. Time-to-first-response from zero: ~1 minute including model download for a 3B GGUF.
Workload fit
Best for: single-user local chat and inference across every hardware target (CPU, GPU, Apple Silicon), embedding generation with broad model support, GGUF-first model ecosystems, CPU-only server deployments where NVIDIA GPUs aren't available, speculative-drafting pipelines with small/large model pairs, offline/air-gapped inference where Docker or complex Python environments are unavailable. Not suited for: high-concurrency production API serving (>10 concurrent requests — use vLLM), latency-competitive deployments where every millisecond counts (use TensorRT-LLM), researcher workflows that need to load models from HuggingFace safetensors directly without GGUF conversion, Windows-first deployments (WSL2 is the supported path).
Alternatives
Use llama.cpp for maximum hardware coverage — it is the only production-quality inference engine that runs on CPU, Apple Silicon, AMD Vulkan, Intel SYCL, and NVIDIA CUDA with a single codebase. The GGUF format ecosystem is the widest: tens of thousands of pre-quantized models on HuggingFace at every quant level from Q2 to Q8. Switch to vLLM when you need multi-tenant production serving on NVIDIA datacenter GPUs — llama.cpp's throughput per-GPU is 3–10× lower than vLLM. Use MLX-LM when on Apple Silicon and you want Apple-optimized inference with Swift-friendly tooling; llama.cpp on Metal is competitive but MLX-LM often edges it on memory bandwidth utilization. Use Ollama when you want a polished CLI and model management layer — Ollama wraps llama.cpp as its backend but adds download/versioning/conversation management that raw llama.cpp lacks. Stick with llama.cpp for embedding generation, speculative decoding, and any CPU-only deployment.
Troubleshooting + when to switch
Problem: llama_model_load: error loading model: invalid model file. Fix: The GGUF file is corrupted or incompatible with your llama.cpp version. Verify with ./build/bin/quantize --validate models/your-model.gguf. Re-download from the bartowski or lmstudio-community quantization on HuggingFace. Problem: GPU offloading slower than CPU-only on Apple Silicon. Fix: Metal backend may be shader-compiling on first inference; run a warmup request first. Ensure -ngl 99 (not -ngl 0) and check that Metal is the active backend with -v flag. Some models have architecture variants (e.g. DeepSeek MoE) that require specific llama.cpp build flags — rebuild with -DGGML_CUDA=ON, -DGGML_METAL=ON, or equivalent. Problem: Server crashes on >32K context. Fix: llama.cpp defaults to model-native context window. Override with -c 8192 if your hardware can't handle full context. Flash attention (-fa) reduces KV-cache memory by ~30% — enable it for long-context workloads.
Stack & relationships
How llama.cpp relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.
Works with
- Works withAnythingLLM
Use llama.cpp's OpenAI-compatible /v1 endpoint. Streaming + tool calls work; some quants need explicit chat-template config.
Alternatives
- Alternative toMLX-LM
On Apple Silicon, MLX-LM is now competitive with llama.cpp Metal — especially on long-context workloads. Pick MLX if you want Apple-native; pick llama.cpp if you want cross-platform GGUF compatibility.
Depends on
- Depends onOllama
Ollama is a llama.cpp wrapper at the inference layer. Improvements to llama.cpp's quant kernels flow through to Ollama on next release.
- Depends onLM Studio
LM Studio bundles a llama.cpp build under the hood. The desktop UI is the differentiator; the engine is shared.
- Depends onLocalAI
LocalAI uses llama.cpp as one of several backends for LLM inference. Architecture coverage tracks llama.cpp upstream for the LLM path; image/audio backends are separate.
- Depends onLM Studio
LM Studio bundles a llama.cpp build. Improvements in llama.cpp's kernel performance flow through to LM Studio on next release.
- RequiresvLLM
Not a runtime dependency — but vLLM does NOT replace llama.cpp for CPU / Apple Silicon / edge. Different categories; if your hardware is outside vLLM's wheelhouse use llama.cpp.
- Depends onPetals
Not a runtime dependency, but Petals leans on the broader llama.cpp / HuggingFace ecosystem for tokenizers and model weights. Architecture support tracks what those upstreams ship.
Lifecycle
- Succeeded byOllama
Ollama wraps llama.cpp with curated model pulls and an OpenAI-compatible API. For most users, Ollama is the front of house and llama.cpp is the engine room.
- Forked fromLlamafile
Mozilla's single-binary distribution of llama.cpp + the Cosmopolitan libc trick. Same engine, zero-install delivery.
Featured in this stack
The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Homelab tier·Role: Inference engine (asymmetric layer-split)Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
llama.cpp is the only practical runtime for asymmetric pairs. Its --tensor-split argument accepts unequal ratios; vLLM and SGLang assume symmetric cards and underperform by 2-3× on mixed setups.
Pros
- Runs everywhere — including phones
- Authoritative GGUF tooling
- Performance-tuned per-architecture
Cons
- Build-from-source culture
- CLI-only by default
- Flag soup
Compatibility
| Operating systems | macOS Linux Windows BSD Android |
| GPU backends | NVIDIA CUDA AMD ROCm Apple Metal Vulkan CPU |
| License | Open source · free |
Runtime health
Operator-grade signals on how actively llama.cpp is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
8 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Ecosystem stability
Editorial rating from RunLocalAI — qualitative, not measured.
Get llama.cpp
Frequently asked
Is llama.cpp free?
What operating systems does llama.cpp support?
Which GPUs work with llama.cpp?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify llama.cpp runs on your specific hardware before committing money.