HOW-TO · INF

How to offload models to CPU when GPU memory is insufficient

intermediate10 minBy Fredoline Eruo
PREREQUISITES

Ollama or llama.cpp installed, model too large for VRAM

What this does

When a model exceeds available VRAM, offloading layers to system RAM allows inference to proceed. This guide covers full and partial CPU offloading with llama.cpp and Ollama.

Steps

  1. Run with zero GPU layers to offload the entire model to CPU.

    ./llama-cli -m model.gguf --n-gpu-layers 0 --threads 8
    

    Expected: The model loads entirely in system RAM. CPU usage spikes during inference.

  2. For Ollama, create a Modelfile that disables GPU offloading.

    FROM llama3:70b
    PARAMETER n_gpu_layers 0
    

    Build and run:

    ollama create cpu-only-70b -f Modelfile
    ollama run cpu-only-70b
    
  3. Tune CPU thread count to maximize throughput.

    ./llama-cli -m model.gguf --n-gpu-layers 0 --threads 16 --threads-batch 16
    

    Set --threads to the number of physical CPU cores (not logical threads) for best performance.

  4. Monitor resource usage to confirm offload.

    nvidia-smi &   # GPU memory should remain flat
    htop           # CPU cores should show high utilization
    

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: GPU memory stays at idle levels (no increase from model load)

Common failures

  • System RAM exhausted: Ensure total RAM exceeds model file size + 20% overhead for KV cache.
  • Extremely slow inference: Increase --threads up to your physical core count. Enable --mlock to prevent swapping.
  • Ollama still uses GPU: The n_gpu_layers parameter must be set in the Modelfile, not at runtime. Verify with ollama show cpu-only-70b.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES