RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Out of memory / CUDA out of memory when loading a model
Out of memory
Verified by owner

CUDA out of memory when loading a model

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:

  • Model size (weights + KV cache + activation buffers) exceeds VRAM
  • Another process is holding VRAM (background browser tab, prior Python session)
  • Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
  • Context window set higher than VRAM can support

Solution

1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).

2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.

# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M

3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.

4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.

# llama.cpp
./main --n-gpu-layers 28 --model model.gguf

5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

Alternative solutions

If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.

Related errors

  • Ollama: model requires more system memory than is available
  • SGLang: RadixAttention KV cache overflow / out of memory
  • CUDA OOM that only happens at long context (KV cache blowup)
  • vLLM AsyncEngineDeadError after large batch / OOM
  • Process killed (OOM killer) when loading large model

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.