How to use Ollama with GPU acceleration (NVIDIA)
NVIDIA GPU with CUDA support, Ollama installed, and CUDA-compatible drivers configured
What this does
Configures Ollama to route model inference through the NVIDIA GPU via CUDA, accelerating token generation compared to CPU-only processing. After completion, model inference runs primarily on the GPU.
Steps
Verify NVIDIA drivers are active. CUDA acceleration requires the host drivers to be functioning before Ollama can use the GPU.
nvidia-smiExpected output: a table showing GPU name, temperature, memory usage, and driver version.
Pull a model that supports GPU offload. Starting with a small quantized model avoids VRAM errors.
ollama pull llama3.2:3bExpected output: the model manifest downloads and layers are pulled successfully.
Run the model with GPU offload. Use the
OLLAMA_GPU_LAYERSenvironment variable to control how many layers Ollama offloads to the CUDA device.OLLAMA_GPU_LAYERS=50 ollama run llama3.2:3b "Explain quantum entanglement in one sentence."Expected output: a response generated at GPU-accelerated speed (normally at least 3-5x faster than CPU).
Tune offload for VRAM budget. Larger offload percentages reduce CPU-GPU transfers but demand more video memory.
OLLAMA_GPU_LAYERS=95 ollama run llama3.2:3b "Show me a short story."Adjust the value based on whether an out-of-memory error appears.
Verification
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
# Expected: non-zero GPU utilization while a model request is in flight
Common failures
- "CUDA out of memory" — The model layers exceed available VRAM. Reduce the
OLLAMA_GPU_LAYERSvalue (try 25) or switch to a smaller quantized model. - "nvidia-smi: command not found" — NVIDIA drivers are not installed. Install the proprietary driver for the GPU before proceeding.
- Model runs but is slow — GPU is not being used. Confirm the CUDA variant of the Ollama binary is installed (not just the CPU build).
- Ollama does not detect GPU — Run
ollama listand check logs withjournalctl -u ollama --no-pager -n 30for CUDA-related errors.