How to offload models to CPU when GPU memory is insufficient
Ollama or llama.cpp installed, model too large for VRAM
What this does
When a model exceeds available VRAM, offloading layers to system RAM allows inference to proceed. This guide covers full and partial CPU offloading with llama.cpp and Ollama.
Steps
Run with zero GPU layers to offload the entire model to CPU.
./llama-cli -m model.gguf --n-gpu-layers 0 --threads 8Expected: The model loads entirely in system RAM. CPU usage spikes during inference.
For Ollama, create a Modelfile that disables GPU offloading.
FROM llama3:70b PARAMETER n_gpu_layers 0Build and run:
ollama create cpu-only-70b -f Modelfile ollama run cpu-only-70bTune CPU thread count to maximize throughput.
./llama-cli -m model.gguf --n-gpu-layers 0 --threads 16 --threads-batch 16Set
--threadsto the number of physical CPU cores (not logical threads) for best performance.Monitor resource usage to confirm offload.
nvidia-smi & # GPU memory should remain flat htop # CPU cores should show high utilization
Verification
nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: GPU memory stays at idle levels (no increase from model load)
Common failures
- System RAM exhausted: Ensure total RAM exceeds model file size + 20% overhead for KV cache.
- Extremely slow inference: Increase
--threadsup to your physical core count. Enable--mlockto prevent swapping. - Ollama still uses GPU: The
n_gpu_layersparameter must be set in the Modelfile, not at runtime. Verify withollama show cpu-only-70b.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.