10. Context Length Errors
Understanding Context Windows
Every model has a maximum context length—Llama 2 has 4096 tokens, Mistral 7B has 8192, many fine-tuned variants extend this further. The context includes your input prompt, generated output, and any system message.
Total tokens = prompt tokens + generated tokens + history tokens (in chat templates)
Diagnosing Context Errors
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# Count tokens in your input
prompt = "Your long prompt here..."
tokens = tokenizer.encode(prompt, add_special_tokens=True)
print(f"Prompt tokens: {len(tokens)}")
print(f"Max context: {tokenizer.model_max_length}")
print(f"Remaining for generation: {tokenizer.model_max_length - len(tokens)}")
Common Fixes
Prompt too long: Truncate the input or use a model with a larger context window.
Chat history accumulation: In multi-turn conversations, accumulated history can exceed context. Implement sliding window context that keeps only the most recent N tokens.
# Sliding window for chat
MAX_CONTEXT = 4096
MAX_HISTORY_TOKENS = 3072
def truncate_history(messages):
total = sum(len(tokenizer.encode(m["content"])) for m in messages)
while total > MAX_HISTORY_TOKENS and len(messages) > 2:
removed = messages.pop(1) # Keep system message
total -= len(tokenizer.encode(removed["content"]))
return messages
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Count the token usage for a typical prompt in your application. Add the typical number of generated tokens. Check if the sum exceeds your model's context window. If it does, calculate the truncation required.