What this does

When a model exceeds available VRAM, offloading layers to system RAM allows inference to proceed. This guide covers full and partial CPU offloading with llama.cpp and Ollama.

Steps

Run with zero GPU layers to offload the entire model to CPU.
```
./llama-cli -m model.gguf --n-gpu-layers 0 --threads 8
```
Expected: The model loads entirely in system RAM. CPU usage spikes during inference.

For Ollama, create a Modelfile that disables GPU offloading.

FROM llama3:70b
PARAMETER n_gpu_layers 0

Build and run:

ollama create cpu-only-70b -f Modelfile
ollama run cpu-only-70b

Tune CPU thread count to maximize throughput.
```
./llama-cli -m model.gguf --n-gpu-layers 0 --threads 16 --threads-batch 16
```
Set --threads to the number of physical CPU cores (not logical threads) for best performance.

Monitor resource usage to confirm offload.

nvidia-smi &   # GPU memory should remain flat
htop           # CPU cores should show high utilization

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: GPU memory stays at idle levels (no increase from model load)

Common failures

System RAM exhausted: Ensure total RAM exceeds model file size + 20% overhead for KV cache.
Extremely slow inference: Increase --threads up to your physical core count. Enable --mlock to prevent swapping.
Ollama still uses GPU: The n_gpu_layers parameter must be set in the Modelfile, not at runtime. Verify with ollama show cpu-only-70b.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to offload models to CPU when GPU memory is insufficient

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides