HOW-TO · RAG
How to Manage Agent Context Window Limits
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Agent deployed, context length awareness, Python 3.10+
What this does
Context window management ensures the agent stays within the LLM's token limit by truncating, summarizing, or compressing conversation history and tool outputs before they exceed the maximum.
Steps
- Count tokens before sending. Use a tokenizer to estimate usage.
import tiktoken
def count_tokens(text: str, model: str = "cl100k_base") -> int:
encoder = tiktoken.get_encoding(model)
return len(encoder.encode(text))
def estimate_message_tokens(messages: list[dict]) -> int:
total = 0
for msg in messages:
total += count_tokens(msg.get("content", ""))
total += 4 # overhead per message
return total
- Implement a sliding context window. Keep only the most recent messages within the limit.
from collections import deque
class ContextManager:
def __init__(self, max_tokens: int = 4000):
self.max_tokens = max_tokens
self.messages = deque()
self.system_prompt = ""
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._trim()
def _trim(self):
while estimate_message_tokens(list(self.messages)) > self.max_tokens:
self.messages.popleft() # Remove oldest
- Summarize old messages instead of dropping them. Preserve information.
def summarize_history(messages: list[dict], llm) -> str:
text = "\n".join(f"{m['role']}: {m['content']}" for m in messages[:-5])
summary = llm.invoke(f"Summarize this conversation concisely:\n{text}")
return summary.content
# Replace old messages with summary
old_msgs = list(context.messages)[:-5]
summary = summarize_history(old_msgs, llm)
context.messages.clear()
context.add_message("system", f"Previous conversation summary: {summary}")
for m in recent_msgs:
context.add_message(m["role"], m["content"])
- Truncate tool outputs. Large tool results are the main context consumer.
MAX_TOOL_OUTPUT_CHARS = 2000
def truncate_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
if len(output) <= max_chars:
return output
return output[:max_chars] + f"\n... [truncated, {len(output)} total chars]"
- Monitor token usage in real-time. Log warnings when approaching the limit.
def check_context_usage(messages: list, limit: int, warn_at: float = 0.8):
used = estimate_message_tokens(messages)
ratio = used / limit
if ratio > warn_at:
logger.warning(f"Context at {ratio:.0%} of limit ({used}/{limit})")
return used
Verification
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode('Hello, world!')
print(len(tokens))
# Expected: 4 (approximately)
"
Common failures
- Tokenizer mismatch. Using
cl100k_basefor a model that usesr50k_basegives inaccurate counts. Use the correct tokenizer for your model. - System prompt too large. A verbose system prompt (e.g., 2K tokens) leaves little room for conversation. Keep system prompts under 500 tokens.
- Sliding window drops the system prompt. Popping the oldest message may remove the system message. Reserve the system prompt position.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- How to Implement Agent Memory (Short and Long Term)
- How to Implement Agent Reflection and Self-Correction