13. Performance Optimization
Performance on Windows depends on five factors: GPU VRAM, GPU utilization, CPU-to-GPU bandwidth, RAM availability, and storage speed for model loading.
GPU utilization:
Verify the GPU is actually being used during inference:
nvidia-smi dmon
# gpu sm mem enc dec
# 0 57 45 0 0 # SM = shader utilization, mem = memory bandwidth
If SM is below 50% during inference, the model is either too small for the GPU or CPU preprocessing is the bottleneck. For small models (under 3B parameters), CPU inference may be faster than GPU inference due to transfer overhead.
Batch size and context length:
Ollama's default settings work for single-user interactive use. For batch processing, tune these in Modelfile:
FROM llama3.2:1b
PARAMETER num_batch 512
PARAMETER num_ctx 4096
PARAMETER gpu 0
Rebuild with ollama create my-tuned-model -f Modelfile.
Quantization impact:
Lower quantization (Q2, Q3) uses less VRAM but degrades output quality. Q4_0 is the practical minimum for most use cases. Q5_K_M gives near-Q6 quality at Q5 memory cost. A 7B Q4 model needs about 4.2 GB VRAM. A 7B Q2 model needs about 2.9 GB but produces noticeably lower quality on tasks requiring precise factual recall.
Storage performance:
Model files load from disk on first run. A 7 GB file on a mechanical HDD takes 60-90 seconds to load. An NVMe SSD loads the same file in 4-8 seconds. Store model files on the fastest available drive. Ollama's default storage location (~/.ollama/models on Linux, %LOCALAPPDATA%\Ollama\models on Windows) goes wherever the system drive is. You can move model storage:
# Move Ollama data to D: drive
sudo mv ~/.ollama /mnt/d/ollama_data
ln -s /mnt/d/ollama_data ~/.ollama
This works inside WSL2. For Windows Ollama, change the environment variable:
$env:OLLAMA_MODELS = "D:\ollama_models"
[Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama_models", "User")
Time model load with time ollama run llama3.2:1b on your current storage. If it takes over 10 seconds, move the model directory to an NVMe drive and time again. Compare the two load times.