What this does

Reducing the context window frees significant VRAM, allowing larger models or quantizations to run on constrained hardware. This guide calculates the optimal context length for your memory budget.

Steps

Measure your current memory usage at default context.

nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Calculate VRAM freed by reducing context. Each token of KV cache consumes approximately 2 * n_layers * d_head * n_heads * bytes_per_param bytes.

# Approximate KV cache size per token (Llama-3-8B, FP16)
n_layers = 32
d_head = 128
n_heads = 32
bytes_per_param = 2  # FP16
kv_per_token_bytes = 2 * n_layers * d_head * n_heads * bytes_per_param
kv_per_token_mb = kv_per_token_bytes / (1024 * 1024)
print(f"KV cache: {kv_per_token_mb:.1f} MB per token")
print(f"Saving: {(8192 - 2048) * kv_per_token_mb:.0f} MB by reducing from 8K to 2K context")

Reduce context at runtime.

curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Hello", "options": {"num_ctx": 2048}}'

Persist reduced context via Modelfile.

FROM llama3.2
PARAMETER num_ctx 1024

ollama create lowmem-llama -f Modelfile

Verify memory reduction.

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Run with default context, note memory, then switch to lowmem-llama
# Expected: 500 MB - 2 GB less VRAM usage

Verification

ollama run lowmem-llama
/show parameters
# Expected: "num_ctx" set to reduced value (e.g., 1024 or 2048)

Common failures

Context too short for the task: Summarizing a book with 1024 context will fail. Match context to your longest expected input.
Over-reduction wastes capacity: If you only save 200 MB by going from 2048 to 1024 but lose half the context, the trade-off may not be worth it.
num_ctx ignored in some Ollama versions: Upgrade to 0.3+ and verify with /show parameters.

How to adjust context length when running on limited hardware

What this does

Steps

Verification

Common failures

Related guides