RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Introduction to AI Agents
  6. /Ch. 11
Introduction to AI Agents

11. Conversation History

Chapter 11 of 16 · 20 min
KEY INSIGHT

Conversation history is a resource with a hard limit. Use hierarchical memory to manage long sessions while preserving the most relevant information for each reasoning step.

Conversation history is the raw material that feeds the agent's next reasoning step. How you format, store, and retrieve it directly affects how well the agent maintains context and avoids repeating itself.

Message role conventions

Standard roles:

  • system: Global instructions and persona (never dropped unless context is full)
  • user: Human input
  • assistant: Model responses, including tool calls
  • tool: Results from tool invocations

Preserving tool call context

When a tool is called, include both the call metadata and the result in the history. This allows the model to understand what happened:

messages = [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "What is 15% of 200?"}
]

# Turn 1: model calls calculator
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)
messages.append({
    "role": "assistant",
    "content": "",  # No text, just tool call
    "tool_calls": response.message.tool_calls
})

# Turn 2: tool result returned
messages.append({
    "role": "tool",
    "tool_call_id": response.message.tool_calls[0].id,
    "content": "30.0"
})

# Turn 3: follow-up or final response
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)

Context window management

Every model has a context window limit. Llama 3.1 8B supports 128K tokens, but local hardware constrains this. Monitor usage:

def count_tokens(messages: list, tokenizer) -> int:
    total = 0
    for msg in messages:
        total += len(tokenizer.encode(msg["content"]))
    return total

# Before making a request
if count_tokens(messages, tokenizer) > 120000:
    print("Warning: approaching context limit")

Hierarchical history

For very long sessions, use a two-tier history:

  1. Recent turns stored verbatim in a rolling buffer
  2. Older turns summarized and stored as a compressed context block
class HierarchicalMemory:
    def __init__(self, window_size: int = 10, summary_threshold: int = 30):
        self.recent = []  # Rolling buffer
        self.summary = "No prior context."
        self.window_size = window_size
        self.summary_threshold = summary_threshold
        self.turn_count = 0
    
    def add(self, user_msg: str, assistant_msg: str):
        self.recent.append({"user": user_msg, "assistant": assistant_msg})
        self.turn_count += 1
        
        if len(self.recent) > self.window_size:
            self.recent.pop(0)
        
        if self.turn_count == self.summary_threshold:
            self._generate_summary()
    
    def get_context(self) -> str:
        return f"Summary of earlier conversation: {self.summary}\n\nRecent turns:\n" + \
               "\n".join(f"User: {m['user']}\nAssistant: {m['assistant']}" for m in self.recent)
EXERCISE

Build a token counter that tracks message history size. Insert a random conversation of 50+ turns and verify the counter correctly flags approaching limits at 80%, 90%, and 100% of a target context window.

← Chapter 10
Agent Memory
Chapter 12 →
Agent Planning