How to enable full GPU acceleration for maximum performance
GPU with sufficient VRAM for target model
What this does
Full GPU acceleration places all model layers and computation on the GPU, delivering the fastest possible inference speeds. This guide covers setup, verification, and optimization.
Steps
Verify your build supports GPU acceleration.
./llama-cli --verbose 2>&1 | findstr "CUDA"Expected:
llama.cpp using CUDA backendor similar.Offload all layers to GPU by setting
-nglabove the total layer count../llama-cli -m model.gguf --n-gpu-layers 999 -p "Hello" -n 32Llamacpp caps at the actual layer count, so 999 is safe.
Enable Flash Attention for faster context processing.
./llama-cli -m model.gguf --n-gpu-layers 999 --flash-attn -p "Long context prompt" -n 128Flash Attention reduces memory usage and speeds up long-context inference.
Increase batch size for GPU utilization.
./llama-cli -m model.gguf --n-gpu-layers 999 --batch-size 512 -p "Prompt"Larger batch sizes improve GPU utilization but consume more VRAM.
Measure GPU utilization during inference.
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1Target: GPU utilization > 90% and memory near capacity.
Verification
# Compare CPU-only vs GPU-accelerated throughput
./llama-bench -m model.gguf -ngl 0 -p 512 -n 128 # CPU baseline
./llama-bench -m model.gguf -ngl 999 -p 512 -n 128 # Full GPU
# Expected: GPU throughput 5-20x higher than CPU
Common failures
- CUDA out of memory: The model plus KV cache exceeds VRAM. Reduce
--batch-sizeor use a smaller quantization. - GPU utilization below 50%: Bottleneck may be prompt processing. Increase batch size or enable Flash Attention.
- CUDA not detected: Rebuild with
cmake -DLLAMA_CUDA=ON -DCMAKE_BUILD_TYPE=Release.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.