Context Length Errors — Troubleshooting Local AI (Chapter 10)

Understanding Context Windows

Every model has a maximum context length—Llama 2 has 4096 tokens, Mistral 7B has 8192, many fine-tuned variants extend this further. The context includes your input prompt, generated output, and any system message.

Total tokens = prompt tokens + generated tokens + history tokens (in chat templates)

Diagnosing Context Errors

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

# Count tokens in your input
prompt = "Your long prompt here..."
tokens = tokenizer.encode(prompt, add_special_tokens=True)
print(f"Prompt tokens: {len(tokens)}")
print(f"Max context: {tokenizer.model_max_length}")
print(f"Remaining for generation: {tokenizer.model_max_length - len(tokens)}")

Common Fixes

Prompt too long: Truncate the input or use a model with a larger context window.

Chat history accumulation: In multi-turn conversations, accumulated history can exceed context. Implement sliding window context that keeps only the most recent N tokens.

# Sliding window for chat
MAX_CONTEXT = 4096
MAX_HISTORY_TOKENS = 3072

def truncate_history(messages):
    total = sum(len(tokenizer.encode(m["content"])) for m in messages)
    while total > MAX_HISTORY_TOKENS and len(messages) > 2:
        removed = messages.pop(1)  # Keep system message
        total -= len(tokenizer.encode(removed["content"]))
    return messages

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.