HOW-TO · SET

How to use Ollama with GPU acceleration (NVIDIA)

intermediate20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

NVIDIA GPU with CUDA support, Ollama installed, and CUDA-compatible drivers configured

What this does

Configures Ollama to route model inference through the NVIDIA GPU via CUDA, accelerating token generation compared to CPU-only processing. After completion, model inference runs primarily on the GPU.

Steps

  1. Verify NVIDIA drivers are active. CUDA acceleration requires the host drivers to be functioning before Ollama can use the GPU.

    nvidia-smi
    

    Expected output: a table showing GPU name, temperature, memory usage, and driver version.

  2. Pull a model that supports GPU offload. Starting with a small quantized model avoids VRAM errors.

    ollama pull llama3.2:3b
    

    Expected output: the model manifest downloads and layers are pulled successfully.

  3. Run the model with GPU offload. Use the OLLAMA_GPU_LAYERS environment variable to control how many layers Ollama offloads to the CUDA device.

    OLLAMA_GPU_LAYERS=50 ollama run llama3.2:3b "Explain quantum entanglement in one sentence."
    

    Expected output: a response generated at GPU-accelerated speed (normally at least 3-5x faster than CPU).

  4. Tune offload for VRAM budget. Larger offload percentages reduce CPU-GPU transfers but demand more video memory.

    OLLAMA_GPU_LAYERS=95 ollama run llama3.2:3b "Show me a short story."
    

    Adjust the value based on whether an out-of-memory error appears.

Verification

nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
# Expected: non-zero GPU utilization while a model request is in flight

Common failures

  • "CUDA out of memory" — The model layers exceed available VRAM. Reduce the OLLAMA_GPU_LAYERS value (try 25) or switch to a smaller quantized model.
  • "nvidia-smi: command not found" — NVIDIA drivers are not installed. Install the proprietary driver for the GPU before proceeding.
  • Model runs but is slow — GPU is not being used. Confirm the CUDA variant of the Ollama binary is installed (not just the CPU build).
  • Ollama does not detect GPU — Run ollama list and check logs with journalctl -u ollama --no-pager -n 30 for CUDA-related errors.

Related guides

RELATED GUIDES