HOW-TO · INF

How to configure context window size for long documents

intermediate10 minBy Fredoline Eruo
PREREQUISITES

Ollama or vLLM installed

What this does

The context window determines how much text the model can process at once. Increasing it allows processing entire documents in a single pass, but consumes more memory.

Steps

  1. Set context window size at runtime in Ollama.

    curl -s http://localhost:11434/api/generate \
      -d '{"model": "llama3.2", "prompt": "Summarize this document...",
           "options": {"num_ctx": 16384}}'
    
  2. Create a Modelfile to persist the context window.

    FROM llama3.2
    PARAMETER num_ctx 32768
    

    Build the model:

    ollama create longctx-llama -f Modelfile
    ollama run longctx-llama
    
  3. For vLLM, set max-model-len at server start.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-3B \
        --max-model-len 65536 \
        --gpu-memory-utilization 0.90
    
  4. Verify the effective context length.

    ollama show longctx-llama | findstr /i "context"
    

    Or in code:

    import ollama
    info = ollama.show("longctx-llama")
    print(f"Context: {info['modelfile'].split('num_ctx')[1].split()[0]}")
    

Verification

# Send a document that exceeds the default 2048 tokens
curl -s http://localhost:11434/api/generate \
  -d '{"model": "longctx-llama", "prompt": "'$(cat long_doc.txt)'\n\nSummarize:",
       "stream": false}' | jq -r '.response'
# Expected: Model processes the full document without truncation errors

Common failures

  • Out of memory with long context: Context consumes ~2 MB per token for KV cache. Reduce to 8192 if VRAM is limited.
  • Model not supporting the requested length: Each model has a max context (e.g., Llama 3.2 supports up to 128K). Check model card.
  • Slower inference at long context: Attention computation is O(n^2) in context length. Enable Flash Attention: --flash-attn.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES