RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /llama.cpp running too slow / CPU-bound on supposedly-GPU build
degrades✓Editorial·Reviewed May 2026

llama.cpp slow — when GPU isn't actually doing the work

If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.

llama.cppNVIDIA CUDAAMD ROCmApple MetalVulkan backend
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Build defaulted to CPU (GPU flag missing or build failed silently)

Diagnose

Run `./llama-cli --help` and check the backend list. If you don't see `cuda` / `metal` / `hip` / `vulkan` listed, the build is CPU-only.

Fix

Rebuild with the right flag: `cmake -B build -DGGML_CUDA=ON` (or GGML_METAL=ON / GGML_HIP=ON / GGML_VULKAN=ON). Wipe the build dir first to avoid stale CMakeCache: `rm -rf build`.

#2

Layers not offloaded to GPU (--n-gpu-layers / -ngl too low)

Diagnose

llama.cpp doesn't auto-offload all layers. Without `-ngl 999` (or model-specific count), layers stay on CPU. `nvidia-smi` shows VRAM usage low; CPU usage high during generation.

Fix

Pass `-ngl 999` to push all layers to GPU. For models that don't fit, pass a number that fits VRAM and accept partial offload. Watch VRAM during load to verify.

#3

Flash-attention not enabled

Diagnose

Long-context generation is slower than expected. `--verbose` doesn't mention flash-attention being active.

Fix

Add `-fa` flag (flash-attention). Cuts KV cache memory + speeds decode 20-40% on supported hardware (RTX 30/40/50-series, RDNA 3+, M-series Apple).

#4

Model file is too large for VRAM (paging from disk)

Diagnose

Model loads but generation is brutally slow (1-3 tok/s). `nvidia-smi` shows VRAM at 100%; disk activity high during inference.

Fix

Smaller quant (Q4_K_M instead of Q5_K_M halves VRAM). Smaller model. Or add VRAM by upgrading GPU.

Best GPU for local AI →
#5

Number of threads misconfigured for prefill

Diagnose

Prefill (processing the prompt) is slow even though decode is fast. Default thread count may not match your CPU.

Fix

Set `-t <physical-cores>` (not logical/SMT cores). For Ryzen 7700X: `-t 8`. For Apple M-series, default usually optimal. Avoid setting threads higher than physical cores — hurts more than it helps.

#6

Running quantized model with FP16 KV cache

Diagnose

Long-context inference saturates VRAM faster than expected. KV cache at FP16 uses 2x the memory of Q8_0.

Fix

Use `--cache-type-k q8_0 --cache-type-v q8_0` to quantize KV cache. Saves 50% of context-related VRAM with minimal quality impact.

Frequently asked questions

What's a normal llama.cpp tok/s on my hardware?

Rough ranges (Q4_K_M with -ngl 999 + -fa): RTX 4090 — 7B ~120 t/s, 13B ~70, 70B ~12-15. RTX 3090 — 7B ~95, 13B ~55, 70B ~10-12. M4 Max — 7B ~85, 13B ~45, 70B ~7-9. If you're 5-10x lower, GPU isn't doing the work.

Should I use llama.cpp or vLLM for serving?

llama.cpp for solo / dev workflows + cross-platform compatibility. vLLM for production multi-user serving (paged KV cache + continuous batching). At 10+ concurrent users, vLLM's throughput is 3-5x llama.cpp.

Does llama.cpp support tensor-parallel multi-GPU?

Yes via `--split-mode row` (or `layer` for layer-split). Performance scales 1.5-1.8x on dual-GPU. ExLlamaV2 / vLLM scale better (1.8-1.9x) but llama.cpp is more portable.

Related troubleshooting

llama.cpp build failed (CUDA / Metal / Vulkan flags rejected)

Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?