RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Ollama — Installation to Mastery
  6. /Ch. 8
Ollama — Installation to Mastery

08. GPU vs CPU Inference

Chapter 8 of 20 · 20 min
KEY INSIGHT

Ollama auto-detects GPUs but may fall back to CPU if drivers are missing or memory is insufficient. Check `ollama ps` after loading a model to verify which processor is active.

Ollama automatically detects available GPU hardware and uses it for inference when a compatible GPU is present. Understanding when GPU acceleration is active-and why it sometimes fails-helps you optimize performance.

Automatic GPU Detection

Ollama checks for GPUs at startup:

  • NVIDIA GPUs - Requires CUDA toolkit and nvidia-container-toolkit. Ollama looks for nvidia-smi and loads CUDA runtime.
  • AMD GPUs - Requires ROCm on Linux. Ollama detects AMD GPUs via ROCm APIs.
  • Apple Silicon - Uses Metal GPU framework automatically on M1/M2/M3 chips.

You can verify GPU usage with ollama ps:

ollama ps

Output shows PROCESSOR column:

NAME            ID      SIZE      PROCESSOR    UNTIL
llama3.2:3b     a3fe239 2.0GB     100% GPU     5 minutes ago

If GPU is not available, the PROCESSOR column shows CPU usage or a warning.

Environment Variables for GPU Control

Variable Default Effect
OLLAMA_GPU_OVERHEAD 0 Memory reserved for system (bytes)
OLLAMA_MAX_VRAM Auto Maximum VRAM per model (bytes)
CUDA_VISIBLE_DEVICES All GPU device IDs to use
OLLAMA_NUM_GPU Auto Number of GPUs for model layers

Force CPU-only mode if GPU inference causes issues:

# Linux/macOS
CUDA_VISIBLE_DEVICES="" ollama run llama3.2:3b

# Windows PowerShell
$env:CUDA_VISIBLE_DEVICES = ""
ollama run llama3.2:3b

Performance Comparison

A benchmark comparing llama3.2:3b on CPU versus GPU (RTX 3060):

Metric CPU (i7-10700) GPU (RTX 3060)
Load time 45s 8s
Tokens/sec 8 42
Memory usage 6.4 GB 2.1 GB + GPU

GPU acceleration reduces load time and increases throughput significantly. The CPU still handles parts of the pipeline (tokenization, post-processing).

EXERCISE

Run ollama ps after loading a model. If you have a GPU, verify the PROCESSOR column shows GPU. If not, check your GPU driver version and CUDA installation.

← Chapter 7
Ollama Python Client
Chapter 9 →
Performance Tuning