llama.cpp slow — when GPU isn't actually doing the work
If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.
Diagnostic order — most likely first
Build defaulted to CPU (GPU flag missing or build failed silently)
Run `./llama-cli --help` and check the backend list. If you don't see `cuda` / `metal` / `hip` / `vulkan` listed, the build is CPU-only.
Rebuild with the right flag: `cmake -B build -DGGML_CUDA=ON` (or GGML_METAL=ON / GGML_HIP=ON / GGML_VULKAN=ON). Wipe the build dir first to avoid stale CMakeCache: `rm -rf build`.
Layers not offloaded to GPU (--n-gpu-layers / -ngl too low)
llama.cpp doesn't auto-offload all layers. Without `-ngl 999` (or model-specific count), layers stay on CPU. `nvidia-smi` shows VRAM usage low; CPU usage high during generation.
Pass `-ngl 999` to push all layers to GPU. For models that don't fit, pass a number that fits VRAM and accept partial offload. Watch VRAM during load to verify.
Flash-attention not enabled
Long-context generation is slower than expected. `--verbose` doesn't mention flash-attention being active.
Add `-fa` flag (flash-attention). Cuts KV cache memory + speeds decode 20-40% on supported hardware (RTX 30/40/50-series, RDNA 3+, M-series Apple).
Model file is too large for VRAM (paging from disk)
Model loads but generation is brutally slow (1-3 tok/s). `nvidia-smi` shows VRAM at 100%; disk activity high during inference.
Smaller quant (Q4_K_M instead of Q5_K_M halves VRAM). Smaller model. Or add VRAM by upgrading GPU.
Number of threads misconfigured for prefill
Prefill (processing the prompt) is slow even though decode is fast. Default thread count may not match your CPU.
Set `-t <physical-cores>` (not logical/SMT cores). For Ryzen 7700X: `-t 8`. For Apple M-series, default usually optimal. Avoid setting threads higher than physical cores — hurts more than it helps.
Running quantized model with FP16 KV cache
Long-context inference saturates VRAM faster than expected. KV cache at FP16 uses 2x the memory of Q8_0.
Use `--cache-type-k q8_0 --cache-type-v q8_0` to quantize KV cache. Saves 50% of context-related VRAM with minimal quality impact.
Frequently asked questions
What's a normal llama.cpp tok/s on my hardware?
Rough ranges (Q4_K_M with -ngl 999 + -fa): RTX 4090 — 7B ~120 t/s, 13B ~70, 70B ~12-15. RTX 3090 — 7B ~95, 13B ~55, 70B ~10-12. M4 Max — 7B ~85, 13B ~45, 70B ~7-9. If you're 5-10x lower, GPU isn't doing the work.
Should I use llama.cpp or vLLM for serving?
llama.cpp for solo / dev workflows + cross-platform compatibility. vLLM for production multi-user serving (paged KV cache + continuous batching). At 10+ concurrent users, vLLM's throughput is 3-5x llama.cpp.
Does llama.cpp support tensor-parallel multi-GPU?
Yes via `--split-mode row` (or `layer` for layer-split). Performance scales 1.5-1.8x on dual-GPU. ExLlamaV2 / vLLM scale better (1.8-1.9x) but llama.cpp is more portable.
Related troubleshooting
Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.
Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: