HOW-TO · INF
How to adjust context length when running on limited hardware
PREREQUISITES
Ollama or vLLM installed, limited VRAM
What this does
Reducing the context window frees significant VRAM, allowing larger models or quantizations to run on constrained hardware. This guide calculates the optimal context length for your memory budget.
Steps
Measure your current memory usage at default context.
nvidia-smi --query-gpu=memory.used,memory.total --format=csvCalculate VRAM freed by reducing context. Each token of KV cache consumes approximately
2 * n_layers * d_head * n_heads * bytes_per_parambytes.# Approximate KV cache size per token (Llama-3-8B, FP16) n_layers = 32 d_head = 128 n_heads = 32 bytes_per_param = 2 # FP16 kv_per_token_bytes = 2 * n_layers * d_head * n_heads * bytes_per_param kv_per_token_mb = kv_per_token_bytes / (1024 * 1024) print(f"KV cache: {kv_per_token_mb:.1f} MB per token") print(f"Saving: {(8192 - 2048) * kv_per_token_mb:.0f} MB by reducing from 8K to 2K context")Reduce context at runtime.
curl -s http://localhost:11434/api/generate \ -d '{"model": "llama3.2", "prompt": "Hello", "options": {"num_ctx": 2048}}'Persist reduced context via Modelfile.
FROM llama3.2 PARAMETER num_ctx 1024ollama create lowmem-llama -f ModelfileVerify memory reduction.
nvidia-smi --query-gpu=memory.used --format=csv,noheader # Run with default context, note memory, then switch to lowmem-llama # Expected: 500 MB - 2 GB less VRAM usage
Verification
ollama run lowmem-llama
/show parameters
# Expected: "num_ctx" set to reduced value (e.g., 1024 or 2048)
Common failures
- Context too short for the task: Summarizing a book with 1024 context will fail. Match context to your longest expected input.
- Over-reduction wastes capacity: If you only save 200 MB by going from 2048 to 1024 but lose half the context, the trade-off may not be worth it.
- num_ctx ignored in some Ollama versions: Upgrade to 0.3+ and verify with
/show parameters.
Related guides
RELATED GUIDES