What this does

When a conversation exceeds the model's context window, older messages are lost. This guide covers sliding window, summarization, and selective truncation strategies.

Steps

Implement sliding window truncation. Keep the system prompt and most recent messages, dropping the oldest turns.

import tiktoken

def trim_to_window(messages, max_tokens=4096, model="llama3.2"):
    enc = tiktoken.get_encoding("cl100k_base")
    token_counts = [len(enc.encode(m.get("content", ""))) for m in messages]
    total = sum(token_counts)
    while total > max_tokens and len(messages) > 2:
        # Remove oldest user/assistant pair (keep system prompt at index 0)
        removed = messages.pop(1)
        total -= token_counts.pop(1)
    return messages

Use summarization as an alternative to dropping. When the window is full, summarize the oldest messages and replace them.

def summarize_old_messages(messages, model="llama3.2"):
    old_content = "\n".join(m["content"] for m in messages[:-4] if m["role"] != "system")
    if len(old_content) < 200:
        return messages
    # Ask the model to summarize
    summary = requests.post("http://localhost:11434/api/generate",
        json={"model": model, "prompt": f"Summarize:\n{old_content}", "stream": False})
    summary_text = summary.json()["response"]
    # Replace old messages with a single summary entry
    messages = [messages[0]] + [{"role": "user", "content": f"[Previous context summary]: {summary_text}"}] + messages[-4:]
    return messages

Set num_ctx to the maximum supported value.

curl -s http://localhost:11434/api/chat \
  -d '{"model": "llama3.2", "messages": [...], "options": {"num_ctx": 32768}}'

Monitor token usage per turn.

def count_tokens(text, model="llama3.2"):
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

Verification

# Run a 50-turn conversation; verify no errors and the model recalls the last 5 turns
# Expected: Model remembers recent context without crashing or producing irrelevant output

Common failures

Dropping the system prompt: Always preserve messages[0] (system message) during truncation.
Summarization changes meaning: The summary may lose nuance. Keep the original last 3-4 turns intact.
tiktoken model mismatch: Use the correct encoding for your model. Llama models often use a custom tokenizer; install transformers and use AutoTokenizer.

How to manage context overflow for very long conversations

What this does

Steps

Verification

Common failures

Related guides