RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to set GPU layers to optimize memory usage
HOW-TO · INF

How to set GPU layers to optimize memory usage

intermediate·10 min·By Fredoline Eruo
PREREQUISITES

llama.cpp or compatible runtime with -ngl flag

What this does

The --n-gpu-layers (-ngl) flag controls how many model layers run on GPU versus CPU. Setting this value correctly maximizes throughput while avoiding out-of-memory errors.

Steps

  1. Query available VRAM before loading.

    nvidia-smi --query-gpu=memory.total,memory.free --format=csv
    
  2. Find the model's total layer count.

    ./llama-cli -m model.gguf --verbose 2>&1 | findstr "n_layers"
    

    Note the total layers (e.g., 80 for Llama-3-70B, 32 for Llama-3-8B).

  3. Calculate optimal GPU layers.

    # Reserve 2 GB for KV cache and overhead
    AVAILABLE_VRAM_GB=22
    MODEL_SIZE_GB=45
    TOTAL_LAYERS=80
    LAYER_MEM_GB=$(($MODEL_SIZE_GB / $TOTAL_LAYERS))
    GPU_LAYERS=$(($AVAILABLE_VRAM_GB / $LAYER_MEM_GB))
    echo "Offload $GPU_LAYERS of $TOTAL_LAYERS layers"
    
  4. Apply the setting at runtime.

    ./llama-cli -m model.gguf --n-gpu-layers 48 -p "Your prompt here"
    
  5. Persist in Ollama via Modelfile.

    FROM llama3:70b
    PARAMETER n_gpu_layers 48
    
    ollama create optimized-70b -f Modelfile
    

Verification

./llama-cli -m model.gguf --n-gpu-layers 48 -p "test" --no-display-prompt 2>&1 | findstr "llm_load_tensors"
# Expected: "offloaded 48/80 layers to GPU"

Common failures

  • VRAM over-commit: Leave 1-2 GB headroom for KV cache, especially with long contexts.
  • Setting too low: Fewer than 10% of layers on GPU yields negligible speedup. Aim for > 30%.
  • No n_layers in model metadata: Some GGUF files don't expose layer count. Estimate: layers ≈ parameters / (hidden_size * intermediate_size * 4).

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • How to configure partial GPU offloading for large models
  • How to monitor CPU and GPU memory during inference
RELATED GUIDES
INF
How to configure partial GPU offloading for large models
INF
How to monitor CPU and GPU memory during inference
← All how-to guidesCourses →