GPU vs CPU Inference — Ollama — Installation to Mastery (Chapter 8)

Ollama automatically detects available GPU hardware and uses it for inference when a compatible GPU is present. Understanding when GPU acceleration is active-and why it sometimes fails-helps you optimize performance.

Automatic GPU Detection

Ollama checks for GPUs at startup:

NVIDIA GPUs - Requires CUDA toolkit and nvidia-container-toolkit. Ollama looks for nvidia-smi and loads CUDA runtime.
AMD GPUs - Requires ROCm on Linux. Ollama detects AMD GPUs via ROCm APIs.
Apple Silicon - Uses Metal GPU framework automatically on M1/M2/M3 chips.

You can verify GPU usage with ollama ps:

ollama ps

Output shows PROCESSOR column:

NAME            ID      SIZE      PROCESSOR    UNTIL
llama3.2:3b     a3fe239 2.0GB     100% GPU     5 minutes ago

If GPU is not available, the PROCESSOR column shows CPU usage or a warning.

Environment Variables for GPU Control

Variable	Default	Effect
`OLLAMA_GPU_OVERHEAD`	`0`	Memory reserved for system (bytes)
`OLLAMA_MAX_VRAM`	Auto	Maximum VRAM per model (bytes)
`CUDA_VISIBLE_DEVICES`	All	GPU device IDs to use
`OLLAMA_NUM_GPU`	Auto	Number of GPUs for model layers

Force CPU-only mode if GPU inference causes issues:

# Linux/macOS
CUDA_VISIBLE_DEVICES="" ollama run llama3.2:3b

# Windows PowerShell
$env:CUDA_VISIBLE_DEVICES = ""
ollama run llama3.2:3b

Performance Comparison

A benchmark comparing llama3.2:3b on CPU versus GPU (RTX 3060):

Metric	CPU (i7-10700)	GPU (RTX 3060)
Load time	45s	8s
Tokens/sec	8	42
Memory usage	6.4 GB	2.1 GB + GPU

GPU acceleration reduces load time and increases throughput significantly. The CPU still handles parts of the pipeline (tokenization, post-processing).