What this does

Partial GPU offloading places some transformer layers on the GPU while keeping others in system RAM. This lets you run models larger than your VRAM by sacrificing some speed for memory efficiency.

Steps

Determine your model's total layer count.
```
./llama-cli -m model.gguf --verbose 2>&1 | grep "n_layers"
```
Expected output: n_layers = 80 (for a 70B model) or similar.
Calculate how many layers your VRAM can hold. Each layer consumes ~200-400 MB depending on quantization. For 16 GB VRAM with a 70B Q4_K_M model:
- Total needed: ~45 GB
- Per layer: ~315 MB
- Max GPU layers: floor(16 GB / 0.315 GB) ≈ 50 layers
Run with partial offloading.
```
./llama-cli -m model.gguf --n-gpu-layers 40 --threads 12
```
Adjust 40 up/down based on your VRAM test.

Tune interactively using binary search. Start with half the layers and increase until VRAM is 90% utilized.

for layers in 20 40 60 80; do
    echo "Testing $layers layers..."
    nvidia-smi --query-gpu=memory.used --format=csv,noheader &
    ./llama-cli -m model.gguf --n-gpu-layers $layers -p "test" --no-display-prompt
    sleep 1
done

Verification

nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv
# Expected: VRAM usage close to capacity, GPU util > 50% during generation

Common failures

GPU out of memory: Reduce --n-gpu-layers by 10 and retry. Monitor with nvidia-smi -l 1.
No speed improvement over CPU-only: Too few layers on GPU (< 20%) provides marginal benefit. Aim for 50%+ of layers.
llama.cpp built without GPU support: Recompile with cmake -DLLAMA_CUDA=ON or download a pre-built CUDA binary.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to configure partial GPU offloading for large models

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides