What this does

Compiles llama.cpp with CUDA support so that matrix operations execute on the GPU, providing significantly higher throughput for inference on NVIDIA hardware.

Steps

Configure CMake with the CUDA flag.
```
cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="75;80;86;90"
```
Expected output: CMake configuration output listing CUDA as enabled.
Compile the project. The CUDA kernels are compiled alongside the CPU backends.
```
cmake --build build --config Release -j$(nproc)
```
Expected output: Build output ending with successful compilation of all binaries.
Run inference with GPU offload enabled. The -ngl flag controls how many model layers are loaded onto the GPU.
```
./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Hello"
```
Expected output: Generation begins with GPU execution logs visible.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Test" 2>&1 | Select-String "cuda"
# Expected: output line containing "cuda" confirming GPU participation

Common failures

CUDA not detected during CMake — nvcc not in PATH. Source the CUDA environment and re-run CMake.
Driver version incompatible — Update the NVIDIA driver or use a CUDA toolkit version compatible with the installed driver.
Out of GPU VRAM — Reduce the number of offloaded layers with -ngl 50 or use a smaller quantized model.
Slow GPU performance despite CUDA being active — Ensure all layers are offloaded (-ngl 99). Partial offloading creates data transfer overhead.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to use llama.cpp with CUDA acceleration

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides