degradesEditorialReviewed May 2026

llama.cpp slow — when GPU isn't actually doing the work

If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.

llama.cppNVIDIA CUDAAMD ROCmApple MetalVulkan backend

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Build defaulted to CPU (GPU flag missing or build failed silently)

Diagnose

Run `./llama-cli --help` and check the backend list. If you don't see `cuda` / `metal` / `hip` / `vulkan` listed, the build is CPU-only.

Fix

Rebuild with the right flag: `cmake -B build -DGGML_CUDA=ON` (or GGML_METAL=ON / GGML_HIP=ON / GGML_VULKAN=ON). Wipe the build dir first to avoid stale CMakeCache: `rm -rf build`.

Layers not offloaded to GPU (--n-gpu-layers / -ngl too low)

Diagnose

llama.cpp doesn't auto-offload all layers. Without `-ngl 999` (or model-specific count), layers stay on CPU. `nvidia-smi` shows VRAM usage low; CPU usage high during generation.

Fix

Pass `-ngl 999` to push all layers to GPU. For models that don't fit, pass a number that fits VRAM and accept partial offload. Watch VRAM during load to verify.

Flash-attention not enabled

Diagnose

Long-context generation is slower than expected. `--verbose` doesn't mention flash-attention being active.

Fix

Add `-fa` flag (flash-attention). Cuts KV cache memory + speeds decode 20-40% on supported hardware (RTX 30/40/50-series, RDNA 3+, M-series Apple).

Model file is too large for VRAM (paging from disk)

Diagnose

Model loads but generation is brutally slow (1-3 tok/s). `nvidia-smi` shows VRAM at 100%; disk activity high during inference.

Fix

Smaller quant (Q4_K_M instead of Q5_K_M halves VRAM). Smaller model. Or add VRAM by upgrading GPU.

Best GPU for local AI →

Number of threads misconfigured for prefill

Diagnose

Prefill (processing the prompt) is slow even though decode is fast. Default thread count may not match your CPU.

Fix

Set `-t <physical-cores>` (not logical/SMT cores). For Ryzen 7700X: `-t 8`. For Apple M-series, default usually optimal. Avoid setting threads higher than physical cores — hurts more than it helps.

Running quantized model with FP16 KV cache

Diagnose

Long-context inference saturates VRAM faster than expected. KV cache at FP16 uses 2x the memory of Q8_0.

Fix

Use `--cache-type-k q8_0 --cache-type-v q8_0` to quantize KV cache. Saves 50% of context-related VRAM with minimal quality impact.

Frequently asked questions

What's a normal llama.cpp tok/s on my hardware?

Rough ranges (Q4_K_M with -ngl 999 + -fa): RTX 4090 — 7B ~120 t/s, 13B ~70, 70B ~12-15. RTX 3090 — 7B ~95, 13B ~55, 70B ~10-12. M4 Max — 7B ~85, 13B ~45, 70B ~7-9. If you're 5-10x lower, GPU isn't doing the work.

Should I use llama.cpp or vLLM for serving?

llama.cpp for solo / dev workflows + cross-platform compatibility. vLLM for production multi-user serving (paged KV cache + continuous batching). At 10+ concurrent users, vLLM's throughput is 3-5x llama.cpp.

Does llama.cpp support tensor-parallel multi-GPU?

Yes via `--split-mode row` (or `layer` for layer-split). Performance scales 1.5-1.8x on dual-GPU. ExLlamaV2 / vLLM scale better (1.8-1.9x) but llama.cpp is more portable.

Related troubleshooting

llama.cpp build failed (CUDA / Metal / Vulkan flags rejected)

Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?