08. GPU vs CPU Inference
Ollama automatically detects available GPU hardware and uses it for inference when a compatible GPU is present. Understanding when GPU acceleration is active-and why it sometimes fails-helps you optimize performance.
Automatic GPU Detection
Ollama checks for GPUs at startup:
- NVIDIA GPUs - Requires CUDA toolkit and nvidia-container-toolkit. Ollama looks for
nvidia-smiand loads CUDA runtime. - AMD GPUs - Requires ROCm on Linux. Ollama detects AMD GPUs via ROCm APIs.
- Apple Silicon - Uses Metal GPU framework automatically on M1/M2/M3 chips.
You can verify GPU usage with ollama ps:
ollama ps
Output shows PROCESSOR column:
NAME ID SIZE PROCESSOR UNTIL
llama3.2:3b a3fe239 2.0GB 100% GPU 5 minutes ago
If GPU is not available, the PROCESSOR column shows CPU usage or a warning.
Environment Variables for GPU Control
| Variable | Default | Effect |
|---|---|---|
OLLAMA_GPU_OVERHEAD |
0 |
Memory reserved for system (bytes) |
OLLAMA_MAX_VRAM |
Auto | Maximum VRAM per model (bytes) |
CUDA_VISIBLE_DEVICES |
All | GPU device IDs to use |
OLLAMA_NUM_GPU |
Auto | Number of GPUs for model layers |
Force CPU-only mode if GPU inference causes issues:
# Linux/macOS
CUDA_VISIBLE_DEVICES="" ollama run llama3.2:3b
# Windows PowerShell
$env:CUDA_VISIBLE_DEVICES = ""
ollama run llama3.2:3b
Performance Comparison
A benchmark comparing llama3.2:3b on CPU versus GPU (RTX 3060):
| Metric | CPU (i7-10700) | GPU (RTX 3060) |
|---|---|---|
| Load time | 45s | 8s |
| Tokens/sec | 8 | 42 |
| Memory usage | 6.4 GB | 2.1 GB + GPU |
GPU acceleration reduces load time and increases throughput significantly. The CPU still handles parts of the pipeline (tokenization, post-processing).
Run ollama ps after loading a model. If you have a GPU, verify the PROCESSOR column shows GPU. If not, check your GPU driver version and CUDA installation.