14. Context Windows

Chapter 14 of 20 · 18 min

What is a Context Window?

The context window is how much text the model can "see" at once. It's measured in tokens.

Practical context windows:

  • Small models: 2K-4K tokens (enough for ~1,500 words)
  • Medium models: 8K-32K tokens (enough for ~5,000-20,000 words)
  • Large models: 128K+ tokens (enough for entire books)

How it works:

When you send a prompt, the entire conversation history fits within the context window. Once you exceed it, you have two options:

  1. Truncation: Old messages are removed to make room for new ones
  2. Manual summarization: You summarize old conversation and start fresh

Context Window in Ollama

# Set context size when running
ollama run llama3.2:7b --param num_ctx 4096

# Default varies by model
ollama show llama3.2:7b | grep context

Context limits in Modelfile:

PARAMETER num_ctx 8192

Why Context Matters

Factual consistency: The model can only "remember" what's in context. If you discuss something 50 messages ago and it falls out of context, the model loses awareness of it.

Code understanding: Large codebases exceed context windows. You can't paste an entire project and expect analysis—you need to paste relevant sections.

Document analysis: A 10-page document might be 3,000 tokens. That fits in most models. A 100-page document might be 30,000 tokens—fits in some, not others.

Practical Implications

Conversation length:

With 8K context and average conversation:

  • System prompt: ~500 tokens
  • Each exchange: ~100 tokens
  • Usable exchanges: ~70 before needing to truncate

File upload:

If you paste a large document and then ask questions about it, that document consumes context. With 4K context, you might have only 2K left for the actual conversation.

Strategies for long context:

  1. Chunking: Analyze document in sections
  2. Summarization: Summarize large documents before deep analysis
  3. RAG (Retrieval Augmented Generation): Use external tools to retrieve relevant chunks

Context Window vs. Model Quality

This is a tradeoff. Bigger context windows require more memory (RAM/VRAM). The same model running with 128K context needs more resources than with 8K context.

Some models are specifically trained for long context (e.g., Llama 3.1 70B has 128K context). Others are optimized for smaller context but better quality in that range.

Rule of thumb: If you regularly need to analyze documents >5,000 words, look for models with 32K+ context. Otherwise, 8K is usually sufficient.

EXERCISE
  1. Calculate how many tokens your typical conversations use: count words in a recent long conversation, divide by 0.75 (rough words-to-tokens ratio).
  2. Check your current model's context limit: ollama show <model> | grep context
  3. Calculate: how many of your typical conversations fit in context? When you hit the limit, what's lost?