RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /llama.cpp
runner
Open source
free
4.6/5
Operational review

llama.cpp

The bedrock of local LLM inference. Most other tools wrap or embed it. Maximum control, maximum platform support, sharpest learning curve.

By Fredoline Eruo·Reviewed May 8, 2026·90,000 GitHub stars

llama.cpp is the engine that turned local LLM inference from a research curiosity into something a hobbyist could run on a laptop. Georgi Gerganov's C++ implementation of LLaMA inference — originally a weekend port of Meta's PyTorch model to plain C — became the dominant local-AI execution layer almost by accident. Today it's the inference layer underneath Ollama, LM Studio, and most of the consumer-grade local-AI ecosystem. The operator-grade question isn't "is llama.cpp good?" — yes, it's the foundation — but "when do you use llama.cpp directly versus via a wrapper?"

Architecture and what llama.cpp actually is

llama.cpp is a C++ library + a set of CLI tools (llama-cli, llama-server, llama-bench, llama-quantize) plus the GGUF model file format that's become the de-facto standard for portable quantized weights. The open-source repo carries 90k+ stars and accepts contributions at high velocity — daily merges to master are routine.

Architecturally, llama.cpp is what other runtimes wrap. Ollama vendors a llama.cpp build and exposes a friendly daemon. LM Studio bundles llama.cpp under a desktop GUI. KoboldCpp is a llama.cpp fork with chat extensions. So when you choose llama.cpp directly, you're choosing maximum control + minimum abstraction — and accepting the operator burden of the build flags, runtime flags, and per-model tuning that the wrappers normally hide.

The execution model: load a GGUF file into RAM (or VRAM if you have a GPU backend compiled), then run autoregressive decoding via vectorized matmul kernels tuned per backend. The library supports CPU (with AVX2 / AVX512 / NEON / AMX), NVIDIA CUDA, AMD ROCm + HIP, Apple Metal, Intel Vulkan, and Sycl. Different backends ship at different maturity levels — see the compatibility matrix below.

Local stack compatibility

llama.cpp's backend story is the broadest of any local-AI runtime: it runs on more hardware than any competitor, and its CPU fallback is the gold standard for "I just want this to work on whatever I have." But "supports a backend" and "the backend is well-tuned" are different statements. Apple Metal + NVIDIA CUDA are the production paths. ROCm has matured but lags CUDA in flash-attention coverage. Vulkan is the universal fallback for GPUs without a first-class path (Intel Arc, older AMD, NVIDIA on systems where CUDA-build is impractical). For runtime-runtime tradeoffs see /compare/engines/ollama-vs-llama-cpp, /compare/engines/vllm-vs-llama-cpp, and /compare/engines/mlx-vs-llama-cpp.

The compatibility matrix below ranks each backend's operator readiness in 2026.

Setup + day-1 reality

Three install paths, ranked by friction:

  1. Pre-built binary (brew install llama.cpp on macOS, package manager on Linux distros). Lowest friction; you get the default backend (Metal on Mac, CPU + AVX on Linux). Works for 90% of getting-started use cases.
  2. Build from source with backend flag (cmake -B build -DGGML_CUDA=ON && cmake --build build). Required when you want CUDA / ROCm / Vulkan / Sycl. The build needs the matching toolchain (CUDA Toolkit / ROCm / Vulkan SDK) installed and visible to CMake. This is where most operator pain happens: cmake flags drift across releases, and a recipe that worked 6 months ago may not work on master today.
  3. Pre-built CUDA binary (released on GitHub for major versions). Acceptable for stable production-ish use but lags master by days-to-weeks.

Once you have a binary, three CLI tools matter: llama-cli for one-shot chat / completion, llama-server for an HTTP API on localhost:8080 (with OpenAI-compat endpoints under /v1/...), and llama-bench for reproducible per-prompt token-throughput measurement. The benchmark tool is meaningfully better than what most other runtimes ship — see the benchmark methodology checklist for how to use it correctly.

GGUF files: download from Hugging Face (the bartowski account is the de-facto canonical quantizer for new models in 2026; older quants live under TheBloke). Place anywhere accessible; pass -m path/to/model.gguf to any CLI tool.

Operational concerns

  • Build flag drift. A working CMake recipe from 6 months ago may fail on current master. The repo's docs/build.md is authoritative; community blog posts go stale fast.
  • Master-vs-release versioning. llama.cpp's release cadence is high — the team tags new releases roughly weekly. Daily master commits are usually safe, but for production pin to a release tag.
  • Sampler defaults. llama.cpp's defaults differ from upstream model card recommendations. For accurate inference, pass -p, --temp, --top-p, --top-k, --repeat-penalty explicitly per model card.
  • No native daemon lifecycle. llama-server is a foreground process; you wrap it in systemd / launchd / Docker yourself. This is a feature, not a gap — but operators new to llama.cpp expect Ollama-style daemon behavior and are surprised.
  • GGUF version migrations. When the GGUF spec adds fields (it does, periodically), older quants stop loading on newer llama.cpp. Re-download from bartowski.

Performance reality

llama.cpp's tuning is the floor of what any GGUF-based runtime can achieve. Wrappers (Ollama, LM Studio) match performance closely. Direct-use llama.cpp can outperform wrappers by 5-10% via flag tuning (-fa for flash-attention on supported backends, -ngl 999 to push all layers to GPU, --no-mmap on systems where mmap behaves badly under memory pressure). Not life-changing, but real. Single-stream tok/s comparable to Ollama's because Ollama IS llama.cpp underneath.

For multi-user concurrent serving, llama.cpp is wrong. The architecture serializes generation. For >1 concurrent user, see vLLM or SGLang.

Failure modes (what breaks)

The operator-grade list, ranked by community-benchmark error frequency:

  1. CUDA OOM at long context. Setting -n 8192 or higher on a 24GB card with a 70B Q4 model exhausts the KV cache. Pre-compute KV memory: 70B Q4 + 8K context ≈ 4 GB cache + 40 GB weights + overhead. You need 48GB+ for 70B at 8K.
  2. Wrong CMake backend for your hardware. Operators running CPU when they expected GPU because -DGGML_CUDA=ON wasn't passed. Always check llama-cli's startup banner for the active backend.
  3. flash-attention mismatch. -fa works on CUDA + Metal but not all ROCm versions. If you see "flash attention not supported," drop the flag.
  4. Tokenizer drift on third-party GGUF quants. Some uploaders (rare, but real) ship GGUFs with subtly wrong tokenizer config. Output looks plausible but diverges from upstream model. Use bartowski / official-org quants when possible.
  5. Memory mapping vs RAM tradeoffs. Default mmap behavior loads weights lazily — first inference is slow, subsequent ones fast. With --no-mmap you load eagerly (slow startup, fast first inference). Operators occasionally pick wrong for their use case.
  6. Multi-GPU layer-split miscount. -ngl N pushes the first N layers to GPU; rest stay on CPU. Setting -ngl 999 (all layers) is usually right; setting it to a small number when you have GPU headroom is silent under-utilization.

How llama.cpp compares

Compared to Ollama: same engine. Ollama wraps llama.cpp with daemon + curated model library + sane defaults. Use llama.cpp directly when you need build-flag control, custom sampling, or grammars (GBNF). Use Ollama when you want it to just work.

Compared to vLLM: vLLM is a different architecture entirely — paged attention, continuous batching, tensor-parallel multi-GPU. vLLM dominates production multi-user serving. llama.cpp dominates single-user laptop / homelab inference. They're not really competitors; they live at different points in the operator's career.

Compared to MLX-LM: MLX is Apple Silicon native, no GGUF, faster on-device on M-class hardware for many workloads. llama.cpp's Metal backend is competitive but MLX often edges it on M1-M3 by 5-15%. On M4 the gap narrows. If you're Apple-only and chasing every tok/s, try MLX. If you want one binary that runs everywhere, llama.cpp is right.

Compared to ExLlamaV2: ExLlamaV2 is NVIDIA-only and chases maximum-throughput single-GPU inference of EXL2-quantized models. It outpaces llama.cpp on a 24GB consumer card for the specific scenario it targets (4-5 bpw EXL2, single-user, NVIDIA). If your use case fits, it's faster. Otherwise llama.cpp's portability wins.

Deployment paths

Three operator-grade deployment shapes are documented in the structured deployment-paths section below: build-from-source homelab path (max control, accept the build burden), llama-server with reverse proxy (HTTP API serving a small team), and pre-built binary daily-driver (CLI use for a single operator). Each card under this review shows hardware + complexity + when it fits.

Editorial verdict

llama.cpp is the foundation. It's not always the right user-facing surface — Ollama is for newcomers, vLLM is for production serving, MLX-LM is for Apple-only — but underneath those wrappers, the engine that delivers the tokens is llama.cpp. Use it directly when you've outgrown the wrappers' opinions or need control they don't expose. Don't use it directly when the wrapper would suffice and operator-time matters more than the last 5-10% of throughput.

The release cadence + community velocity means llama.cpp keeps improving faster than any competitor. The team has shipped meaningful performance wins (KV-cache reuse, flash-attention, speculative decoding, quant precision) on a near-monthly basis since 2023. That's compounding, and it's why even production-grade alternatives ship llama.cpp under the hood.

Last reviewed 2026-05-08 by RunLocalAI editorial. Reproduce or correct: /submit/feedback.

Local stack compatibility
StatusRuntime / StackNotes
ExcellentApple Silicon (M1-M4, Metal)First-class Metal backend, native autovectorization on AMX. The reference implementation for Apple-side tuning. Sweet spot for 7B-13B at conversational latencies; M-Max + 64GB unified memory comfortably runs 70B Q4. flash-attention available.
ExcellentNVIDIA CUDA (RTX 30/40/50)Production CUDA path with flash-attention + speculative decoding + tensor-parallel. Build with -DGGML_CUDA=ON. The fastest path on consumer NVIDIA except where ExLlamaV2 wins for EXL2-specific scenarios.
GoodAMD ROCm (RX 7000 / 9000 / Instinct)ROCm 6+ supported via -DGGML_HIP=ON. Per-feature gaps: flash-attention available on most consumer GPUs, less consistent on Instinct. Worth verifying current state per release.
ExcellentCPU-only (x86_64, AVX2 / AVX512 / AMX)Reference CPU implementation with hand-tuned SIMD per ISA. The gold standard for laptop / homelab CPU inference. Usable for 7B Q4 on 16GB RAM, 13B Q4 on 32GB. Single-digit tok/s but always works.
GoodIntel Vulkan / SyclVulkan compute path is the universal fallback for GPUs without a first-class backend. Works on Intel Arc, older AMD, and even some NVIDIA via SPIR-V. Performance trails CUDA but is usable.
ExcellentARM64 (Apple Silicon CPU, Snapdragon, Graviton)ARM NEON tuned. Runs well on cloud ARM (Graviton, Ampere Altra) for CPU-only inference. Mobile path (Snapdragon X Elite, etc.) lacks NPU acceleration but CPU performance is honest.
GoodDatacenter (H100 / A100 / MI300X)Runs but underutilizes. Tensor-parallel + paged attention belong to vLLM/SGLang for these cards. Use llama.cpp on datacenter GPUs for ad-hoc dev, single-user inference, or as a portable fallback.
GoodMulti-GPU layer-split-ngl + tensor-split + main-gpu flags work for spilling weights across multiple NVIDIA cards (single-stream serial). Not tensor-parallel — concurrent throughput doesn't scale. Best for fitting one large model across two cards.
Real deployment paths

Pre-built binary daily driver

trivial

Brew install / pacman install / GitHub release download. Run llama-cli or llama-server directly, point at a GGUF from bartowski, ship. Lowest friction; covers 80% of single-operator use cases. Choose this when you don't need GPU acceleration on a non-Mac (the pre-built binaries default to CPU on Linux / Windows).

Hardware: Apple Silicon Mac OR consumer NVIDIA / AMD GPU · macOS / Windows / Linux · 16GB+ unified memory or 12GB+ VRAM

Build from source with GPU backend

involved

When you need CUDA / ROCm / Vulkan / Sycl acceleration on Linux or Windows. Clone master, install backend toolchain, cmake -B build -DGGML_CUDA=ON (or -DGGML_HIP=ON / -DGGML_VULKAN=ON), cmake --build build -j. Pin to a release tag for production. The reference workflow for max control + max performance + the cost of build-flag drift across releases.

Hardware: GPU + matching toolchain (CUDA / ROCm / Vulkan SDK) · CMake 3.21+ · Linux (preferred) or Windows · 30 min compile budget

llama-server behind reverse proxy

moderate

When you want a small-team OpenAI-compatible API but don't yet need vLLM. llama-server runs on :8080 with /v1/chat/completions and /v1/embeddings endpoints. Front it with caddy (auto-TLS) or nginx (manual cert) for HTTPS + basic auth. Wrap in systemd for restart-on-crash. The cliff: real concurrent users (>1) queue and feel slow — when that hurts, migrate to vLLM.

Hardware: Linux server with GPU · static LAN IP · nginx or caddy · systemd unit · 1-3 concurrent users

Setup guidance

Install via package manager: macOS brew install llama.cpp, or build from source: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build --config Release. The server binary is at ./build/bin/llama-server. Download a GGUF model from HuggingFace (e.g. hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF). Start: ./build/bin/llama-server -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --port 8080. Verify: curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. The server exposes OpenAI-compatible /v1/chat/completions and /v1/completions endpoints plus a web UI at http://localhost:8080. First run time is download-only — inference starts instantly after model load (~5–30 seconds depending on model size). llama.cpp runs on CPU, CUDA, Metal (Apple Silicon), Vulkan, SYCL (Intel), and ROCm (AMD) backends — specify via -ngl 99 to offload all layers to GPU. Time-to-first-response from zero: ~1 minute including model download for a 3B GGUF.

Workload fit

Best for: single-user local chat and inference across every hardware target (CPU, GPU, Apple Silicon), embedding generation with broad model support, GGUF-first model ecosystems, CPU-only server deployments where NVIDIA GPUs aren't available, speculative-drafting pipelines with small/large model pairs, offline/air-gapped inference where Docker or complex Python environments are unavailable. Not suited for: high-concurrency production API serving (>10 concurrent requests — use vLLM), latency-competitive deployments where every millisecond counts (use TensorRT-LLM), researcher workflows that need to load models from HuggingFace safetensors directly without GGUF conversion, Windows-first deployments (WSL2 is the supported path).

Alternatives

Use llama.cpp for maximum hardware coverage — it is the only production-quality inference engine that runs on CPU, Apple Silicon, AMD Vulkan, Intel SYCL, and NVIDIA CUDA with a single codebase. The GGUF format ecosystem is the widest: tens of thousands of pre-quantized models on HuggingFace at every quant level from Q2 to Q8. Switch to vLLM when you need multi-tenant production serving on NVIDIA datacenter GPUs — llama.cpp's throughput per-GPU is 3–10× lower than vLLM. Use MLX-LM when on Apple Silicon and you want Apple-optimized inference with Swift-friendly tooling; llama.cpp on Metal is competitive but MLX-LM often edges it on memory bandwidth utilization. Use Ollama when you want a polished CLI and model management layer — Ollama wraps llama.cpp as its backend but adds download/versioning/conversation management that raw llama.cpp lacks. Stick with llama.cpp for embedding generation, speculative decoding, and any CPU-only deployment.

Troubleshooting + when to switch

Problem: llama_model_load: error loading model: invalid model file. Fix: The GGUF file is corrupted or incompatible with your llama.cpp version. Verify with ./build/bin/quantize --validate models/your-model.gguf. Re-download from the bartowski or lmstudio-community quantization on HuggingFace. Problem: GPU offloading slower than CPU-only on Apple Silicon. Fix: Metal backend may be shader-compiling on first inference; run a warmup request first. Ensure -ngl 99 (not -ngl 0) and check that Metal is the active backend with -v flag. Some models have architecture variants (e.g. DeepSeek MoE) that require specific llama.cpp build flags — rebuild with -DGGML_CUDA=ON, -DGGML_METAL=ON, or equivalent. Problem: Server crashes on >32K context. Fix: llama.cpp defaults to model-native context window. Override with -c 8192 if your hardware can't handle full context. Flash attention (-fa) reduces KV-cache memory by ~30% — enable it for long-context workloads.

Stack & relationships

How llama.cpp relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

llama.cpp ↔ ecosystem

Works with

  • Works with
    AnythingLLM

    Use llama.cpp's OpenAI-compatible /v1 endpoint. Streaming + tool calls work; some quants need explicit chat-template config.

Alternatives

  • Alternative to
    MLX-LM

    On Apple Silicon, MLX-LM is now competitive with llama.cpp Metal — especially on long-context workloads. Pick MLX if you want Apple-native; pick llama.cpp if you want cross-platform GGUF compatibility.

Depends on

  • Depends on
    Ollama

    Ollama is a llama.cpp wrapper at the inference layer. Improvements to llama.cpp's quant kernels flow through to Ollama on next release.

  • Depends on
    LM Studio

    LM Studio bundles a llama.cpp build under the hood. The desktop UI is the differentiator; the engine is shared.

  • Depends on
    LocalAI

    LocalAI uses llama.cpp as one of several backends for LLM inference. Architecture coverage tracks llama.cpp upstream for the LLM path; image/audio backends are separate.

  • Depends on
    LM Studio

    LM Studio bundles a llama.cpp build. Improvements in llama.cpp's kernel performance flow through to LM Studio on next release.

  • Requires
    vLLM

    Not a runtime dependency — but vLLM does NOT replace llama.cpp for CPU / Apple Silicon / edge. Different categories; if your hardware is outside vLLM's wheelhouse use llama.cpp.

  • Depends on
    Petals

    Not a runtime dependency, but Petals leans on the broader llama.cpp / HuggingFace ecosystem for tokenizers and model weights. Architecture support tracks what those upstreams ship.

Lifecycle

  • Succeeded by
    Ollama

    Ollama wraps llama.cpp with curated model pulls and an OpenAI-compatible API. For most users, Ollama is the front of house and llama.cpp is the engine room.

  • Forked from
    Llamafile

    Mozilla's single-binary distribution of llama.cpp + the Cosmopolitan libc trick. Same engine, zero-install delivery.

Featured in this stack

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Homelab tier·Role: Inference engine (asymmetric layer-split)
    Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path

    llama.cpp is the only practical runtime for asymmetric pairs. Its --tensor-split argument accepts unequal ratios; vLLM and SGLang assume symmetric cards and underperform by 2-3× on mixed setups.

Pros

  • Runs everywhere — including phones
  • Authoritative GGUF tooling
  • Performance-tuned per-architecture

Cons

  • Build-from-source culture
  • CLI-only by default
  • Flag soup

Compatibility

Operating systems
macOS
Linux
Windows
BSD
Android
GPU backends
NVIDIA CUDA
AMD ROCm
Apple Metal
Vulkan
CPU
LicenseOpen source · free

Runtime health

Operator-grade signals on how actively llama.cpp is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.6/5✓Editorial

Get llama.cpp

Official site
https://github.com/ggml-org/llama.cpp
GitHub
https://github.com/ggml-org/llama.cpp

Frequently asked

Is llama.cpp free?

Yes — llama.cpp is free to use and open-source.

What operating systems does llama.cpp support?

llama.cpp supports macOS, Linux, Windows, BSD, Android.

Which GPUs work with llama.cpp?

llama.cpp supports NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, CPU. CPU-only operation is also possible but typically slower.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • Apple M4 Max vs RTX 4090 →
Buyer guides
  • Best GPU for local AI →
  • Best budget GPU →
When it doesn't work
  • llama.cpp too slow →
  • llama.cpp build failed →
  • llama.cpp Metal crash (Mac) →
  • GGUF tokenizer mismatch →
Recommended hardware
  • RTX 3090 (used) →
  • Apple M4 Max →
Alternatives
LlamafileMLX-LMExLlamaV2IPEX-LLMIntel OpenVINODirectMLllama-cpp-pythonAphrodite Engine
Before you buy

Verify llama.cpp runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →