RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Troubleshooting Local AI
  6. /Ch. 4
Troubleshooting Local AI

04. OOM Errors

Chapter 4 of 15 · 20 min
KEY INSIGHT

Exit code 137 always means the OOM killer terminated the process. It is not a Python error or a model error—it is the kernel enforcing memory limits. Check `dmesg` to confirm and `free -h` to understand how much headroom existed.

Types of OOM Errors

Three different "out of memory" errors require different fixes:

  1. GPU OOM (CUDA out of memory): VRAM exhausted during inference or training
  2. CPU OOM (Killed in dmesg, exit code 137): System RAM exhausted
  3. Swap OOM: System using swap heavily, causing latency spikes

Diagnosing GPU OOM

# Monitor GPU memory in real-time during inference
watch -n 0.5 nvidia-smi

Common causes:

Model too large for VRAM: A 13B parameter model in FP16 requires ~26GB VRAM. A 70B model requires ~140GB. Quantization reduces this (Q4_K_M roughly halves VRAM usage).

# Check how much VRAM a loaded model uses
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Batch size too large: Larger batch sizes increase memory proportional to model size.

# Reduce batch size from 8 to 2
model.generate(input_ids, max_new_tokens=100, do_sample=True, num_beams=1)

KV cache not released: Some inference loops fail to release the KV cache between requests, accumulating memory usage over time.

Diagnosing CPU OOM

# Check system memory usage
free -h
# Check which processes use most memory
ps aux --sort=-%mem | head -20
# Check dmesg for OOM killer
sudo dmesg | grep -i "out of memory"
sudo dmesg | grep -i "killed process"

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Load a model and run inference while monitoring memory with watch -n 0.5 nvidia-smi. Note memory usage before, during, and after inference. Check free -h on the host. This baseline tells you your available headroom before OOM occurs.

← Chapter 3
GPU Not Detected
Chapter 5 →
Model Download Failures