OOM Errors — Troubleshooting Local AI (Chapter 4)

Types of OOM Errors

Three different "out of memory" errors require different fixes:

GPU OOM (CUDA out of memory): VRAM exhausted during inference or training
CPU OOM (Killed in dmesg, exit code 137): System RAM exhausted
Swap OOM: System using swap heavily, causing latency spikes

Diagnosing GPU OOM

# Monitor GPU memory in real-time during inference
watch -n 0.5 nvidia-smi

Common causes:

Model too large for VRAM: A 13B parameter model in FP16 requires ~26GB VRAM. A 70B model requires ~140GB. Quantization reduces this (Q4_K_M roughly halves VRAM usage).

# Check how much VRAM a loaded model uses
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Batch size too large: Larger batch sizes increase memory proportional to model size.

# Reduce batch size from 8 to 2
model.generate(input_ids, max_new_tokens=100, do_sample=True, num_beams=1)

KV cache not released: Some inference loops fail to release the KV cache between requests, accumulating memory usage over time.

Diagnosing CPU OOM

# Check system memory usage
free -h
# Check which processes use most memory
ps aux --sort=-%mem | head -20
# Check dmesg for OOM killer
sudo dmesg | grep -i "out of memory"
sudo dmesg | grep -i "killed process"

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.