04. OOM Errors
Types of OOM Errors
Three different "out of memory" errors require different fixes:
- GPU OOM (
CUDA out of memory): VRAM exhausted during inference or training - CPU OOM (
Killedin dmesg, exit code 137): System RAM exhausted - Swap OOM: System using swap heavily, causing latency spikes
Diagnosing GPU OOM
# Monitor GPU memory in real-time during inference
watch -n 0.5 nvidia-smi
Common causes:
Model too large for VRAM: A 13B parameter model in FP16 requires ~26GB VRAM. A 70B model requires ~140GB. Quantization reduces this (Q4_K_M roughly halves VRAM usage).
# Check how much VRAM a loaded model uses
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
Batch size too large: Larger batch sizes increase memory proportional to model size.
# Reduce batch size from 8 to 2
model.generate(input_ids, max_new_tokens=100, do_sample=True, num_beams=1)
KV cache not released: Some inference loops fail to release the KV cache between requests, accumulating memory usage over time.
Diagnosing CPU OOM
# Check system memory usage
free -h
# Check which processes use most memory
ps aux --sort=-%mem | head -20
# Check dmesg for OOM killer
sudo dmesg | grep -i "out of memory"
sudo dmesg | grep -i "killed process"
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Load a model and run inference while monitoring memory with watch -n 0.5 nvidia-smi. Note memory usage before, during, and after inference. Check free -h on the host. This baseline tells you your available headroom before OOM occurs.