How to set GPU layers to optimize memory usage
llama.cpp or compatible runtime with -ngl flag
What this does
The --n-gpu-layers (-ngl) flag controls how many model layers run on GPU versus CPU. Setting this value correctly maximizes throughput while avoiding out-of-memory errors.
Steps
Query available VRAM before loading.
nvidia-smi --query-gpu=memory.total,memory.free --format=csvFind the model's total layer count.
./llama-cli -m model.gguf --verbose 2>&1 | findstr "n_layers"Note the total layers (e.g., 80 for Llama-3-70B, 32 for Llama-3-8B).
Calculate optimal GPU layers.
# Reserve 2 GB for KV cache and overhead AVAILABLE_VRAM_GB=22 MODEL_SIZE_GB=45 TOTAL_LAYERS=80 LAYER_MEM_GB=$(($MODEL_SIZE_GB / $TOTAL_LAYERS)) GPU_LAYERS=$(($AVAILABLE_VRAM_GB / $LAYER_MEM_GB)) echo "Offload $GPU_LAYERS of $TOTAL_LAYERS layers"Apply the setting at runtime.
./llama-cli -m model.gguf --n-gpu-layers 48 -p "Your prompt here"Persist in Ollama via Modelfile.
FROM llama3:70b PARAMETER n_gpu_layers 48ollama create optimized-70b -f Modelfile
Verification
./llama-cli -m model.gguf --n-gpu-layers 48 -p "test" --no-display-prompt 2>&1 | findstr "llm_load_tensors"
# Expected: "offloaded 48/80 layers to GPU"
Common failures
- VRAM over-commit: Leave 1-2 GB headroom for KV cache, especially with long contexts.
- Setting too low: Fewer than 10% of layers on GPU yields negligible speedup. Aim for > 30%.
- No
n_layersin model metadata: Some GGUF files don't expose layer count. Estimate: layers ≈ parameters / (hidden_size * intermediate_size * 4).
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.