What causes "CUDA out of memory when loading a model"?

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes: - Model size (weights + KV cache + activation buffers) exceeds VRAM - Another process is holding VRAM (background browser tab, prior Python session) - Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models) - Context window set higher than VRAM can support

Out of memory

Verified by owner

CUDA out of memory when loading a model

Q: How do you fix "CUDA out of memory when loading a model"?

**1. Free other VRAM.** Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (`nvidia-smi` shows what's using VRAM, kill the offender with `kill `). **2. Use a smaller quantization.** If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%. ```bash # Ollama ollama pull qwen2.5:7b-instruct-q4_K_M ``` **3. Reduce context window.** A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth. **4. Use CPU offload.** Move some layers to system RAM. Speed drops but the model fits. ```bash # llama.cpp ./main --n-gpu-layers 28 --model model.gguf ``` **5. Pick a smaller model.** Use [Will it run?](/will-it-run) to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB

By Fredoline Eruo · Last verified Jun 12, 2026

Cause

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:

Model size (weights + KV cache + activation buffers) exceeds VRAM
Another process is holding VRAM (background browser tab, prior Python session)
Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
Context window set higher than VRAM can support

Solution

1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).

2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.

# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M

3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.

4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.

# llama.cpp
./main --n-gpu-layers 28 --model model.gguf

5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

Alternative solutions

If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.

Related errors

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.