18. OOM Errors
Out-of-memory (OOM) errors occur when the model requires more VRAM or system RAM than is available. This chapter covers preventing and handling OOM situations.
Symptoms
CUDA out of memoryerror during model loadingError: resource exhaustedin API responses- Ollama crashes or restarts
- System becomes unresponsive
Understanding Memory Requirements
Each model has a VRAM requirement based on parameter count and quantization:
| Model | Parameters | Quantization | VRAM Required |
|---|---|---|---|
| llama3.2:1b | 1B | Q4_K_M | ~700 MB |
| llama3.2:3b | 3B | Q4_K_M | ~2 GB |
| llama3.2:8b | 8B | Q4_K_M | ~5 GB |
| codellama:13b | 13B | Q4_K_M | ~8 GB |
| llama3.2:70b | 70B | Q4_K_M | ~40 GB |
Multi-GB models often need 8-24 GB of VRAM. System RAM is used when VRAM is insufficient, but this dramatically slows inference.
Reducing Memory Usage
Use smaller or more quantized models:
ollama run llama3.2:1b # 700 MB VRAM
ollama run llama3.2:3b # 2 GB VRAM
Reduce context window size:
ollama run llama3.2:3b --param num_ctx 1024
Smaller context uses less memory for the KV cache.
Limit GPU layers:
ollama run llama3.2:8b --param num_gpu 24
Offloading layers to CPU reduces VRAM usage at the cost of speed.
System Memory Limits
Ollama has a maximum memory setting:
# Linux/macOS - limit to 8 GB system RAM
export OLLAMA_MAX_LOADED_MODELS=1
ollama serve
# Windows PowerShell
$env:OLLAMA_MAX_LOADED_MODELS = "1"
ollama serve
Handling OOM in API Requests
When generating with long contexts, the KV cache grows. If you see OOM errors during generation, reduce prompt size or clear conversation history.
from ollama import chat, Client
client = Client(host='http://localhost:11434')
try:
response = client.chat(model='llama3.2:3b', messages=[
{'role': 'user', 'content': 'Very long prompt...' * 100}
])
except Exception as e:
if 'out of memory' in str(e).lower():
print("Reduce prompt size or context window")
raise
Monitoring Memory
Watch memory usage while running:
# NVIDIA GPU memory
watch -n 1 nvidia-smi
# System RAM
watch -n 1 free -h
When approaching limits, stop unused models:
ollama stop llama3.2:8b
Docker Memory Limits
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
limits:
memory: 8G
# ...
Without proper limits, the container can consume all system memory and trigger OOM killer.
Start monitoring memory with watch -n 1 nvidia-smi (or free -h on CPU-only systems). Load a model and watch memory usage. Try loading a second model and observe what happens.