What this does

The context window determines how much text the model can process at once. Increasing it allows processing entire documents in a single pass, but consumes more memory.

Steps

Set context window size at runtime in Ollama.

curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Summarize this document...",
       "options": {"num_ctx": 16384}}'

Create a Modelfile to persist the context window.

FROM llama3.2
PARAMETER num_ctx 32768

Build the model:

ollama create longctx-llama -f Modelfile
ollama run longctx-llama

For vLLM, set max-model-len at server start.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90

Verify the effective context length.

ollama show longctx-llama | findstr /i "context"

Or in code:

import ollama
info = ollama.show("longctx-llama")
print(f"Context: {info['modelfile'].split('num_ctx')[1].split()[0]}")

Verification

# Send a document that exceeds the default 2048 tokens
curl -s http://localhost:11434/api/generate \
  -d '{"model": "longctx-llama", "prompt": "'$(cat long_doc.txt)'\n\nSummarize:",
       "stream": false}' | jq -r '.response'
# Expected: Model processes the full document without truncation errors

Common failures

Out of memory with long context: Context consumes ~2 MB per token for KV cache. Reduce to 8192 if VRAM is limited.
Model not supporting the requested length: Each model has a max context (e.g., Llama 3.2 supports up to 128K). Check model card.
Slower inference at long context: Attention computation is O(n^2) in context length. Enable Flash Attention: --flash-attn.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to configure context window size for long documents

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides