CUDA out of memory when loading a model
Cause
The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:
- Model size (weights + KV cache + activation buffers) exceeds VRAM
- Another process is holding VRAM (background browser tab, prior Python session)
- Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
- Context window set higher than VRAM can support
Solution
1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).
2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.
# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M
3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.
4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.
# llama.cpp
./main --n-gpu-layers 28 --model model.gguf
5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.
Alternative solutions
If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.