18. OOM Errors

Chapter 18 of 20 · 20 min

Out-of-memory (OOM) errors occur when the model requires more VRAM or system RAM than is available. This chapter covers preventing and handling OOM situations.

Symptoms

  • CUDA out of memory error during model loading
  • Error: resource exhausted in API responses
  • Ollama crashes or restarts
  • System becomes unresponsive

Understanding Memory Requirements

Each model has a VRAM requirement based on parameter count and quantization:

Model Parameters Quantization VRAM Required
llama3.2:1b 1B Q4_K_M ~700 MB
llama3.2:3b 3B Q4_K_M ~2 GB
llama3.2:8b 8B Q4_K_M ~5 GB
codellama:13b 13B Q4_K_M ~8 GB
llama3.2:70b 70B Q4_K_M ~40 GB

Multi-GB models often need 8-24 GB of VRAM. System RAM is used when VRAM is insufficient, but this dramatically slows inference.

Reducing Memory Usage

Use smaller or more quantized models:

ollama run llama3.2:1b  # 700 MB VRAM
ollama run llama3.2:3b  # 2 GB VRAM

Reduce context window size:

ollama run llama3.2:3b --param num_ctx 1024

Smaller context uses less memory for the KV cache.

Limit GPU layers:

ollama run llama3.2:8b --param num_gpu 24

Offloading layers to CPU reduces VRAM usage at the cost of speed.

System Memory Limits

Ollama has a maximum memory setting:

# Linux/macOS - limit to 8 GB system RAM
export OLLAMA_MAX_LOADED_MODELS=1
ollama serve

# Windows PowerShell
$env:OLLAMA_MAX_LOADED_MODELS = "1"
ollama serve

Handling OOM in API Requests

When generating with long contexts, the KV cache grows. If you see OOM errors during generation, reduce prompt size or clear conversation history.

from ollama import chat, Client

client = Client(host='http://localhost:11434')

try:
    response = client.chat(model='llama3.2:3b', messages=[
        {'role': 'user', 'content': 'Very long prompt...' * 100}
    ])
except Exception as e:
    if 'out of memory' in str(e).lower():
        print("Reduce prompt size or context window")
    raise

Monitoring Memory

Watch memory usage while running:

# NVIDIA GPU memory
watch -n 1 nvidia-smi

# System RAM
watch -n 1 free -h

When approaching limits, stop unused models:

ollama stop llama3.2:8b

Docker Memory Limits

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        limits:
          memory: 8G
    # ...

Without proper limits, the container can consume all system memory and trigger OOM killer.

EXERCISE

Start monitoring memory with watch -n 1 nvidia-smi (or free -h on CPU-only systems). Load a model and watch memory usage. Try loading a second model and observe what happens.