HOW-TO · INF

How to configure partial GPU offloading for large models

intermediate15 minBy Fredoline Eruo
PREREQUISITES

llama.cpp or Ollama installed, GPU with limited VRAM

What this does

Partial GPU offloading places some transformer layers on the GPU while keeping others in system RAM. This lets you run models larger than your VRAM by sacrificing some speed for memory efficiency.

Steps

  1. Determine your model's total layer count.

    ./llama-cli -m model.gguf --verbose 2>&1 | grep "n_layers"
    

    Expected output: n_layers = 80 (for a 70B model) or similar.

  2. Calculate how many layers your VRAM can hold. Each layer consumes ~200-400 MB depending on quantization. For 16 GB VRAM with a 70B Q4_K_M model:

    • Total needed: ~45 GB
    • Per layer: ~315 MB
    • Max GPU layers: floor(16 GB / 0.315 GB) ≈ 50 layers
  3. Run with partial offloading.

    ./llama-cli -m model.gguf --n-gpu-layers 40 --threads 12
    

    Adjust 40 up/down based on your VRAM test.

  4. Tune interactively using binary search. Start with half the layers and increase until VRAM is 90% utilized.

    for layers in 20 40 60 80; do
        echo "Testing $layers layers..."
        nvidia-smi --query-gpu=memory.used --format=csv,noheader &
        ./llama-cli -m model.gguf --n-gpu-layers $layers -p "test" --no-display-prompt
        sleep 1
    done
    

Verification

nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv
# Expected: VRAM usage close to capacity, GPU util > 50% during generation

Common failures

  • GPU out of memory: Reduce --n-gpu-layers by 10 and retry. Monitor with nvidia-smi -l 1.
  • No speed improvement over CPU-only: Too few layers on GPU (< 20%) provides marginal benefit. Aim for 50%+ of layers.
  • llama.cpp built without GPU support: Recompile with cmake -DLLAMA_CUDA=ON or download a pre-built CUDA binary.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES