RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to use llama.cpp with CUDA acceleration
HOW-TO · SET

How to use llama.cpp with CUDA acceleration

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

CUDA-capable GPU, NVIDIA drivers, llama.cpp source

What this does

Compiles llama.cpp with CUDA support so that matrix operations execute on the GPU, providing significantly higher throughput for inference on NVIDIA hardware.

Steps

  1. Configure CMake with the CUDA flag.

    cmake -B build -DGGML_CUDA=ON \
      -DCMAKE_CUDA_ARCHITECTURES="75;80;86;90"
    

    Expected output: CMake configuration output listing CUDA as enabled.

  2. Compile the project. The CUDA kernels are compiled alongside the CPU backends.

    cmake --build build --config Release -j$(nproc)
    

    Expected output: Build output ending with successful compilation of all binaries.

  3. Run inference with GPU offload enabled. The -ngl flag controls how many model layers are loaded onto the GPU.

    ./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Hello"
    

    Expected output: Generation begins with GPU execution logs visible.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

./build/bin/llama-cli -m /path/to/model.gguf -ngl 99 -p "Test" 2>&1 | Select-String "cuda"
# Expected: output line containing "cuda" confirming GPU participation

Common failures

  • CUDA not detected during CMake — nvcc not in PATH. Source the CUDA environment and re-run CMake.
  • Driver version incompatible — Update the NVIDIA driver or use a CUDA toolkit version compatible with the installed driver.
  • Out of GPU VRAM — Reduce the number of offloaded layers with -ngl 50 or use a smaller quantized model.
  • Slow GPU performance despite CUDA being active — Ensure all layers are offloaded (-ngl 99). Partial offloading creates data transfer overhead.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • How to run inference with llama.cpp server
  • How to quantize a model for llama.cpp
  • Course Ollama Deep Dive
RELATED GUIDES
SET
How to quantize a model for llama.cpp
SET
How to run inference with llama.cpp server
← All how-to guidesCourses →