Hardware & infrastructure

CPU Offload

CPU offload is a technique where parts of a neural network model are processed by the CPU instead of the GPU, typically because the model exceeds available VRAM. During inference, the runtime splits the model: some layers run on the GPU, others on the CPU, with data transferred between them via PCIe. This allows running larger models than the GPU alone can hold, but at significantly lower speed — tokens per second can drop by 10-100x depending on the offload ratio and memory bandwidth.

Deeper dive

When a model's weights and intermediate states exceed GPU VRAM, the runtime must either fail or use CPU offload. In practice, offload is handled automatically by frameworks like llama.cpp (with --num-gpu-layers), Ollama (via OLLAMA_NUM_PARALLEL and VRAM detection), or Hugging Face Transformers (with device_map='auto'). The offload granularity is typically per-layer: transformer layers are assigned to GPU or CPU. The CPU handles its layers using system RAM, which is slower than VRAM and has lower bandwidth. The PCIe transfer time for activations between GPU and CPU adds further latency. Operators often use partial offload — keeping as many layers on GPU as possible — to balance model size and speed. For example, a 70B model quantized to Q4 (~40 GB) on a 24 GB GPU might offload 40% of layers to CPU, yielding ~2-5 tok/s instead of ~20 tok/s if fully GPU-resident.

Practical example

An operator with an RTX 3090 (24 GB VRAM) wants to run Llama 3.1 70B Q4_K_M (~40 GB). Without offload, the model won't load. Using llama.cpp with -ngl 20 (20 layers on GPU, rest on CPU), the runtime loads ~10 GB of weights into VRAM and the rest into system RAM. Inference runs at ~3 tok/s, limited by CPU throughput and PCIe transfers. Dropping to -ngl 10 reduces VRAM usage further but slows to ~1 tok/s.

Workflow example

In Ollama, CPU offload is automatic: when you run ollama run llama3.1:70b on a 24 GB GPU, Ollama loads as many layers as fit into VRAM and offloads the rest. You can check offload status in the server logs: "offloaded 20/80 layers to GPU". In llama.cpp, you control offload explicitly with --num-gpu-layers N. In Hugging Face Transformers, device_map='auto' in from_pretrained splits the model across GPU and CPU. LM Studio shows a slider for GPU layers in the model settings.

Reviewed by Fredoline Eruo. See our editorial policy.