HOW-TO · INF
How to manage context overflow for very long conversations
PREREQUISITES
Ollama or similar chat runtime
What this does
When a conversation exceeds the model's context window, older messages are lost. This guide covers sliding window, summarization, and selective truncation strategies.
Steps
Implement sliding window truncation. Keep the system prompt and most recent messages, dropping the oldest turns.
import tiktoken def trim_to_window(messages, max_tokens=4096, model="llama3.2"): enc = tiktoken.get_encoding("cl100k_base") token_counts = [len(enc.encode(m.get("content", ""))) for m in messages] total = sum(token_counts) while total > max_tokens and len(messages) > 2: # Remove oldest user/assistant pair (keep system prompt at index 0) removed = messages.pop(1) total -= token_counts.pop(1) return messagesUse summarization as an alternative to dropping. When the window is full, summarize the oldest messages and replace them.
def summarize_old_messages(messages, model="llama3.2"): old_content = "\n".join(m["content"] for m in messages[:-4] if m["role"] != "system") if len(old_content) < 200: return messages # Ask the model to summarize summary = requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": f"Summarize:\n{old_content}", "stream": False}) summary_text = summary.json()["response"] # Replace old messages with a single summary entry messages = [messages[0]] + [{"role": "user", "content": f"[Previous context summary]: {summary_text}"}] + messages[-4:] return messagesSet
num_ctxto the maximum supported value.curl -s http://localhost:11434/api/chat \ -d '{"model": "llama3.2", "messages": [...], "options": {"num_ctx": 32768}}'Monitor token usage per turn.
def count_tokens(text, model="llama3.2"): enc = tiktoken.get_encoding("cl100k_base") return len(enc.encode(text))
Verification
# Run a 50-turn conversation; verify no errors and the model recalls the last 5 turns
# Expected: Model remembers recent context without crashing or producing irrelevant output
Common failures
- Dropping the system prompt: Always preserve
messages[0](system message) during truncation. - Summarization changes meaning: The summary may lose nuance. Keep the original last 3-4 turns intact.
- tiktoken model mismatch: Use the correct encoding for your model. Llama models often use a custom tokenizer; install
transformersand useAutoTokenizer.
Related guides
RELATED GUIDES