HOW-TO · INF

How to adjust context length when running on limited hardware

intermediate10 minBy Fredoline Eruo
PREREQUISITES

Ollama or vLLM installed, limited VRAM

What this does

Reducing the context window frees significant VRAM, allowing larger models or quantizations to run on constrained hardware. This guide calculates the optimal context length for your memory budget.

Steps

  1. Measure your current memory usage at default context.

    nvidia-smi --query-gpu=memory.used,memory.total --format=csv
    
  2. Calculate VRAM freed by reducing context. Each token of KV cache consumes approximately 2 * n_layers * d_head * n_heads * bytes_per_param bytes.

    # Approximate KV cache size per token (Llama-3-8B, FP16)
    n_layers = 32
    d_head = 128
    n_heads = 32
    bytes_per_param = 2  # FP16
    kv_per_token_bytes = 2 * n_layers * d_head * n_heads * bytes_per_param
    kv_per_token_mb = kv_per_token_bytes / (1024 * 1024)
    print(f"KV cache: {kv_per_token_mb:.1f} MB per token")
    print(f"Saving: {(8192 - 2048) * kv_per_token_mb:.0f} MB by reducing from 8K to 2K context")
    
  3. Reduce context at runtime.

    curl -s http://localhost:11434/api/generate \
      -d '{"model": "llama3.2", "prompt": "Hello", "options": {"num_ctx": 2048}}'
    
  4. Persist reduced context via Modelfile.

    FROM llama3.2
    PARAMETER num_ctx 1024
    
    ollama create lowmem-llama -f Modelfile
    
  5. Verify memory reduction.

    nvidia-smi --query-gpu=memory.used --format=csv,noheader
    # Run with default context, note memory, then switch to lowmem-llama
    # Expected: 500 MB - 2 GB less VRAM usage
    

Verification

ollama run lowmem-llama
/show parameters
# Expected: "num_ctx" set to reduced value (e.g., 1024 or 2048)

Common failures

  • Context too short for the task: Summarizing a book with 1024 context will fail. Match context to your longest expected input.
  • Over-reduction wastes capacity: If you only save 200 MB by going from 2048 to 1024 but lose half the context, the trade-off may not be worth it.
  • num_ctx ignored in some Ollama versions: Upgrade to 0.3+ and verify with /show parameters.

Related guides

RELATED GUIDES