Performance Optimization — Local AI on Windows (Chapter 13)

Performance on Windows depends on five factors: GPU VRAM, GPU utilization, CPU-to-GPU bandwidth, RAM availability, and storage speed for model loading.

GPU utilization:

Verify the GPU is actually being used during inference:

nvidia-smi dmon
# gpu   sm   mem   enc   dec
#  0    57    45    0     0     # SM = shader utilization, mem = memory bandwidth

If SM is below 50% during inference, the model is either too small for the GPU or CPU preprocessing is the bottleneck. For small models (under 3B parameters), CPU inference may be faster than GPU inference due to transfer overhead.

Batch size and context length:

Ollama's default settings work for single-user interactive use. For batch processing, tune these in Modelfile:

FROM llama3.2:1b
PARAMETER num_batch 512
PARAMETER num_ctx 4096
PARAMETER gpu 0

Rebuild with ollama create my-tuned-model -f Modelfile.

Quantization impact:

Lower quantization (Q2, Q3) uses less VRAM but degrades output quality. Q4_0 is the practical minimum for most use cases. Q5_K_M gives near-Q6 quality at Q5 memory cost. A 7B Q4 model needs about 4.2 GB VRAM. A 7B Q2 model needs about 2.9 GB but produces noticeably lower quality on tasks requiring precise factual recall.

Storage performance:

Model files load from disk on first run. A 7 GB file on a mechanical HDD takes 60-90 seconds to load. An NVMe SSD loads the same file in 4-8 seconds. Store model files on the fastest available drive. Ollama's default storage location (~/.ollama/models on Linux, %LOCALAPPDATA%\Ollama\models on Windows) goes wherever the system drive is. You can move model storage:

# Move Ollama data to D: drive
sudo mv ~/.ollama /mnt/d/ollama_data
ln -s /mnt/d/ollama_data ~/.ollama

This works inside WSL2. For Windows Ollama, change the environment variable:

$env:OLLAMA_MODELS = "D:\ollama_models"
[Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama_models", "User")