What this does

Full GPU acceleration places all model layers and computation on the GPU, delivering the fastest possible inference speeds. This guide covers setup, verification, and optimization.

Steps

Verify your build supports GPU acceleration.
```
./llama-cli --verbose 2>&1 | findstr "CUDA"
```
Expected: llama.cpp using CUDA backend or similar.
Offload all layers to GPU by setting -ngl above the total layer count.
```
./llama-cli -m model.gguf --n-gpu-layers 999 -p "Hello" -n 32
```
Llamacpp caps at the actual layer count, so 999 is safe.
Enable Flash Attention for faster context processing.
```
./llama-cli -m model.gguf --n-gpu-layers 999 --flash-attn -p "Long context prompt" -n 128
```
Flash Attention reduces memory usage and speeds up long-context inference.
Increase batch size for GPU utilization.
```
./llama-cli -m model.gguf --n-gpu-layers 999 --batch-size 512 -p "Prompt"
```
Larger batch sizes improve GPU utilization but consume more VRAM.
Measure GPU utilization during inference.
```
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1
```
Target: GPU utilization > 90% and memory near capacity.

Verification

# Compare CPU-only vs GPU-accelerated throughput
./llama-bench -m model.gguf -ngl 0 -p 512 -n 128   # CPU baseline
./llama-bench -m model.gguf -ngl 999 -p 512 -n 128  # Full GPU
# Expected: GPU throughput 5-20x higher than CPU

Common failures

CUDA out of memory: The model plus KV cache exceeds VRAM. Reduce --batch-size or use a smaller quantization.
GPU utilization below 50%: Bottleneck may be prompt processing. Increase batch size or enable Flash Attention.
CUDA not detected: Rebuild with cmake -DLLAMA_CUDA=ON -DCMAKE_BUILD_TYPE=Release.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to enable full GPU acceleration for maximum performance

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides