How to use llama.cpp with CUDA acceleration
CUDA-capable GPU, NVIDIA drivers, llama.cpp source
What this does
Compiles llama.cpp with CUDA support so that matrix operations execute on the GPU, providing significantly higher throughput for inference on NVIDIA hardware.
Steps
Configure CMake with the CUDA flag.
cmake -B build -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES="75;80;86;90"Expected output: CMake configuration output listing CUDA as enabled.
Compile the project. The CUDA kernels are compiled alongside the CPU backends.
cmake --build build --config Release -j$(nproc)Expected output: Build output ending with successful compilation of all binaries.
Run inference with GPU offload enabled. The
-nglflag controls how many model layers are loaded onto the GPU../build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Hello"Expected output: Generation begins with GPU execution logs visible.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Test" 2>&1 | Select-String "cuda"
# Expected: output line containing "cuda" confirming GPU participation
Common failures
- CUDA not detected during CMake —
nvccnot in PATH. Source the CUDA environment and re-run CMake. - Driver version incompatible — Update the NVIDIA driver or use a CUDA toolkit version compatible with the installed driver.
- Out of GPU VRAM — Reduce the number of offloaded layers with
-ngl 50or use a smaller quantized model. - Slow GPU performance despite CUDA being active — Ensure all layers are offloaded (
-ngl 99). Partial offloading creates data transfer overhead.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.