RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Manage Agent Context Window Limits
HOW-TO · RAG

How to Manage Agent Context Window Limits

intermediate·15 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Agent deployed, context length awareness, Python 3.10+

What this does

Context window management ensures the agent stays within the LLM's token limit by truncating, summarizing, or compressing conversation history and tool outputs before they exceed the maximum.

Steps

  • Count tokens before sending. Use a tokenizer to estimate usage.
import tiktoken

def count_tokens(text: str, model: str = "cl100k_base") -> int:
    encoder = tiktoken.get_encoding(model)
    return len(encoder.encode(text))

def estimate_message_tokens(messages: list[dict]) -> int:
    total = 0
    for msg in messages:
        total += count_tokens(msg.get("content", ""))
        total += 4  # overhead per message
    return total
  • Implement a sliding context window. Keep only the most recent messages within the limit.
from collections import deque

class ContextManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = deque()
        self.system_prompt = ""

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        while estimate_message_tokens(list(self.messages)) > self.max_tokens:
            self.messages.popleft()  # Remove oldest
  • Summarize old messages instead of dropping them. Preserve information.
def summarize_history(messages: list[dict], llm) -> str:
    text = "\n".join(f"{m['role']}: {m['content']}" for m in messages[:-5])
    summary = llm.invoke(f"Summarize this conversation concisely:\n{text}")
    return summary.content

# Replace old messages with summary
old_msgs = list(context.messages)[:-5]
summary = summarize_history(old_msgs, llm)
context.messages.clear()
context.add_message("system", f"Previous conversation summary: {summary}")
for m in recent_msgs:
    context.add_message(m["role"], m["content"])
  • Truncate tool outputs. Large tool results are the main context consumer.
MAX_TOOL_OUTPUT_CHARS = 2000

def truncate_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
    if len(output) <= max_chars:
        return output
    return output[:max_chars] + f"\n... [truncated, {len(output)} total chars]"
  • Monitor token usage in real-time. Log warnings when approaching the limit.
def check_context_usage(messages: list, limit: int, warn_at: float = 0.8):
    used = estimate_message_tokens(messages)
    ratio = used / limit
    if ratio > warn_at:
        logger.warning(f"Context at {ratio:.0%} of limit ({used}/{limit})")
    return used

Verification

python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode('Hello, world!')
print(len(tokens))
# Expected: 4 (approximately)
"

Common failures

  • Tokenizer mismatch. Using cl100k_base for a model that uses r50k_base gives inaccurate counts. Use the correct tokenizer for your model.
  • System prompt too large. A verbose system prompt (e.g., 2K tokens) leaves little room for conversation. Keep system prompts under 500 tokens.
  • Sliding window drops the system prompt. Popping the oldest message may remove the system message. Reserve the system prompt position.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • How to Implement Agent Memory (Short and Long Term)
  • How to Implement Agent Reflection and Self-Correction
← All how-to guidesCourses →