Context Length Tradeoffs — Understanding AI Models (Chapter 7)

Context length-the maximum tokens a model can process in one forward pass-is a spec that directly impacts what tasks you can run. Understanding the tradeoffs helps you choose models based on actual needs.

Why context length matters:

Short context forces you to chunk long documents, losing cross-chunk relationships:

Task: Summarize a 100-page technical document
8K context: Must chunk into ~20 sections, lose inter-section dependencies
32K context: Process in 2-3 chunks, maintain more coherence

What determines context length:

Three factors limit effective context:

Position encoding limits: Original transformers used sin/cos encodings that degraded at long ranges. RoPE (Rotary Position Embedding) extends this, but needs careful tuning.
KV cache memory: At 4096 context, the cache is manageable. At 128K, even with optimized attention, the memory is substantial.
Training data composition: Models trained on short contexts may not generalize well to longer ones even with position encoding extensions.

Real context length comparison:

Model	Context	VRAM at 4K ctx	VRAM at max ctx
Llama 3.1 8B	128K	~6GB	~12GB
Mistral 7B	32K	~6GB	~8GB
Phi-3-mini	128K	~4GB	~8GB
Gemma 2 9B	8K	~9GB	~9GB

The "lost in the middle" problem:

Studies show models struggle to retrieve information from the middle of long contexts. This is not just about context length but about attention patterns and retrieval capability.

How to verify effective context:

Run a test where you place a unique fact in different positions within a long context:

System: You have a pet unicorn named Zephyr.
User: What is my pet's name? [plus 100K padding tokens]

A model with true 128K capability should retrieve "Zephyr" regardless of position.