How to configure context window size for long documents
Ollama or vLLM installed
What this does
The context window determines how much text the model can process at once. Increasing it allows processing entire documents in a single pass, but consumes more memory.
Steps
Set context window size at runtime in Ollama.
curl -s http://localhost:11434/api/generate \ -d '{"model": "llama3.2", "prompt": "Summarize this document...", "options": {"num_ctx": 16384}}'Create a Modelfile to persist the context window.
FROM llama3.2 PARAMETER num_ctx 32768Build the model:
ollama create longctx-llama -f Modelfile ollama run longctx-llamaFor vLLM, set
max-model-lenat server start.python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B \ --max-model-len 65536 \ --gpu-memory-utilization 0.90Verify the effective context length.
ollama show longctx-llama | findstr /i "context"Or in code:
import ollama info = ollama.show("longctx-llama") print(f"Context: {info['modelfile'].split('num_ctx')[1].split()[0]}")
Verification
# Send a document that exceeds the default 2048 tokens
curl -s http://localhost:11434/api/generate \
-d '{"model": "longctx-llama", "prompt": "'$(cat long_doc.txt)'\n\nSummarize:",
"stream": false}' | jq -r '.response'
# Expected: Model processes the full document without truncation errors
Common failures
- Out of memory with long context: Context consumes ~2 MB per token for KV cache. Reduce to 8192 if VRAM is limited.
- Model not supporting the requested length: Each model has a max context (e.g., Llama 3.2 supports up to 128K). Check model card.
- Slower inference at long context: Attention computation is O(n^2) in context length. Enable Flash Attention:
--flash-attn.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.