07. Context Length Tradeoffs
Context length-the maximum tokens a model can process in one forward pass-is a spec that directly impacts what tasks you can run. Understanding the tradeoffs helps you choose models based on actual needs.
Why context length matters:
Short context forces you to chunk long documents, losing cross-chunk relationships:
Task: Summarize a 100-page technical document
8K context: Must chunk into ~20 sections, lose inter-section dependencies
32K context: Process in 2-3 chunks, maintain more coherence
What determines context length:
Three factors limit effective context:
Position encoding limits: Original transformers used sin/cos encodings that degraded at long ranges. RoPE (Rotary Position Embedding) extends this, but needs careful tuning.
KV cache memory: At 4096 context, the cache is manageable. At 128K, even with optimized attention, the memory is substantial.
Training data composition: Models trained on short contexts may not generalize well to longer ones even with position encoding extensions.
Real context length comparison:
| Model | Context | VRAM at 4K ctx | VRAM at max ctx |
|---|---|---|---|
| Llama 3.1 8B | 128K | ~6GB | ~12GB |
| Mistral 7B | 32K | ~6GB | ~8GB |
| Phi-3-mini | 128K | ~4GB | ~8GB |
| Gemma 2 9B | 8K | ~9GB | ~9GB |
The "lost in the middle" problem:
Studies show models struggle to retrieve information from the middle of long contexts. This is not just about context length but about attention patterns and retrieval capability.
How to verify effective context:
Run a test where you place a unique fact in different positions within a long context:
System: You have a pet unicorn named Zephyr.
User: What is my pet's name? [plus 100K padding tokens]
A model with true 128K capability should retrieve "Zephyr" regardless of position.
Find a long-context model and test its retrieval at different positions (beginning, middle, end) using a unique string. Document whether retrieval degrades at any position.