What this does

Context window management ensures the agent stays within the LLM's token limit by truncating, summarizing, or compressing conversation history and tool outputs before they exceed the maximum.

Steps

Count tokens before sending. Use a tokenizer to estimate usage.

import tiktoken

def count_tokens(text: str, model: str = "cl100k_base") -> int:
    encoder = tiktoken.get_encoding(model)
    return len(encoder.encode(text))

def estimate_message_tokens(messages: list[dict]) -> int:
    total = 0
    for msg in messages:
        total += count_tokens(msg.get("content", ""))
        total += 4  # overhead per message
    return total

Implement a sliding context window. Keep only the most recent messages within the limit.

from collections import deque

class ContextManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = deque()
        self.system_prompt = ""

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        while estimate_message_tokens(list(self.messages)) > self.max_tokens:
            self.messages.popleft()  # Remove oldest

Summarize old messages instead of dropping them. Preserve information.

def summarize_history(messages: list[dict], llm) -> str:
    text = "\n".join(f"{m['role']}: {m['content']}" for m in messages[:-5])
    summary = llm.invoke(f"Summarize this conversation concisely:\n{text}")
    return summary.content

# Replace old messages with summary
old_msgs = list(context.messages)[:-5]
summary = summarize_history(old_msgs, llm)
context.messages.clear()
context.add_message("system", f"Previous conversation summary: {summary}")
for m in recent_msgs:
    context.add_message(m["role"], m["content"])

Truncate tool outputs. Large tool results are the main context consumer.

MAX_TOOL_OUTPUT_CHARS = 2000

def truncate_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
    if len(output) <= max_chars:
        return output
    return output[:max_chars] + f"\n... [truncated, {len(output)} total chars]"

Monitor token usage in real-time. Log warnings when approaching the limit.

def check_context_usage(messages: list, limit: int, warn_at: float = 0.8):
    used = estimate_message_tokens(messages)
    ratio = used / limit
    if ratio > warn_at:
        logger.warning(f"Context at {ratio:.0%} of limit ({used}/{limit})")
    return used

Verification

python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode('Hello, world!')
print(len(tokens))
# Expected: 4 (approximately)
"

Common failures

Tokenizer mismatch. Using cl100k_base for a model that uses r50k_base gives inaccurate counts. Use the correct tokenizer for your model.
System prompt too large. A verbose system prompt (e.g., 2K tokens) leaves little room for conversation. Keep system prompts under 500 tokens.
Sliding window drops the system prompt. Popping the oldest message may remove the system message. Reserve the system prompt position.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

How to Implement Agent Memory (Short and Long Term)
How to Implement Agent Reflection and Self-Correction