What this does

The --n-gpu-layers (-ngl) flag controls how many model layers run on GPU versus CPU. Setting this value correctly maximizes throughput while avoiding out-of-memory errors.

Steps

Query available VRAM before loading.

nvidia-smi --query-gpu=memory.total,memory.free --format=csv

Find the model's total layer count.
```
./llama-cli -m model.gguf --verbose 2>&1 | findstr "n_layers"
```
Note the total layers (e.g., 80 for Llama-3-70B, 32 for Llama-3-8B).

Calculate optimal GPU layers.

# Reserve 2 GB for KV cache and overhead
AVAILABLE_VRAM_GB=22
MODEL_SIZE_GB=45
TOTAL_LAYERS=80
LAYER_MEM_GB=$(($MODEL_SIZE_GB / $TOTAL_LAYERS))
GPU_LAYERS=$(($AVAILABLE_VRAM_GB / $LAYER_MEM_GB))
echo "Offload $GPU_LAYERS of $TOTAL_LAYERS layers"

Apply the setting at runtime.

./llama-cli -m model.gguf --n-gpu-layers 48 -p "Your prompt here"

Persist in Ollama via Modelfile.

FROM llama3:70b
PARAMETER n_gpu_layers 48

ollama create optimized-70b -f Modelfile

Verification

./llama-cli -m model.gguf --n-gpu-layers 48 -p "test" --no-display-prompt 2>&1 | findstr "llm_load_tensors"
# Expected: "offloaded 48/80 layers to GPU"

Common failures

VRAM over-commit: Leave 1-2 GB headroom for KV cache, especially with long contexts.
Setting too low: Fewer than 10% of layers on GPU yields negligible speedup. Aim for > 30%.
No n_layers in model metadata: Some GGUF files don't expose layer count. Estimate: layers ≈ parameters / (hidden_size * intermediate_size * 4).

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to set GPU layers to optimize memory usage

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides