OOM Errors — Ollama — Installation to Mastery (Chapter 18)

Out-of-memory (OOM) errors occur when the model requires more VRAM or system RAM than is available. This chapter covers preventing and handling OOM situations.

Symptoms

CUDA out of memory error during model loading
Error: resource exhausted in API responses
Ollama crashes or restarts
System becomes unresponsive

Understanding Memory Requirements

Each model has a VRAM requirement based on parameter count and quantization:

Model	Parameters	Quantization	VRAM Required
llama3.2:1b	1B	Q4_K_M	~700 MB
llama3.2:3b	3B	Q4_K_M	~2 GB
llama3.2:8b	8B	Q4_K_M	~5 GB
codellama:13b	13B	Q4_K_M	~8 GB
llama3.2:70b	70B	Q4_K_M	~40 GB

Multi-GB models often need 8-24 GB of VRAM. System RAM is used when VRAM is insufficient, but this dramatically slows inference.

Reducing Memory Usage

Use smaller or more quantized models:

ollama run llama3.2:1b  # 700 MB VRAM
ollama run llama3.2:3b  # 2 GB VRAM

Reduce context window size:

ollama run llama3.2:3b --param num_ctx 1024

Smaller context uses less memory for the KV cache.

Limit GPU layers:

ollama run llama3.2:8b --param num_gpu 24

Offloading layers to CPU reduces VRAM usage at the cost of speed.

System Memory Limits

Ollama has a maximum memory setting:

# Linux/macOS - limit to 8 GB system RAM
export OLLAMA_MAX_LOADED_MODELS=1
ollama serve

# Windows PowerShell
$env:OLLAMA_MAX_LOADED_MODELS = "1"
ollama serve

Handling OOM in API Requests

When generating with long contexts, the KV cache grows. If you see OOM errors during generation, reduce prompt size or clear conversation history.

from ollama import chat, Client

client = Client(host='http://localhost:11434')

try:
    response = client.chat(model='llama3.2:3b', messages=[
        {'role': 'user', 'content': 'Very long prompt...' * 100}
    ])
except Exception as e:
    if 'out of memory' in str(e).lower():
        print("Reduce prompt size or context window")
    raise

Monitoring Memory

Watch memory usage while running:

# NVIDIA GPU memory
watch -n 1 nvidia-smi

# System RAM
watch -n 1 free -h

When approaching limits, stop unused models:

ollama stop llama3.2:8b

Docker Memory Limits

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        limits:
          memory: 8G
    # ...

Without proper limits, the container can consume all system memory and trigger OOM killer.