Conversation History — Introduction to AI Agents (Chapter 11)

Conversation history is the raw material that feeds the agent's next reasoning step. How you format, store, and retrieve it directly affects how well the agent maintains context and avoids repeating itself.

Message role conventions

Standard roles:

system: Global instructions and persona (never dropped unless context is full)
user: Human input
assistant: Model responses, including tool calls
tool: Results from tool invocations

Preserving tool call context

When a tool is called, include both the call metadata and the result in the history. This allows the model to understand what happened:

messages = [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "What is 15% of 200?"}
]

# Turn 1: model calls calculator
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)
messages.append({
    "role": "assistant",
    "content": "",  # No text, just tool call
    "tool_calls": response.message.tool_calls
})

# Turn 2: tool result returned
messages.append({
    "role": "tool",
    "tool_call_id": response.message.tool_calls[0].id,
    "content": "30.0"
})

# Turn 3: follow-up or final response
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)

Context window management

Every model has a context window limit. Llama 3.1 8B supports 128K tokens, but local hardware constrains this. Monitor usage:

def count_tokens(messages: list, tokenizer) -> int:
    total = 0
    for msg in messages:
        total += len(tokenizer.encode(msg["content"]))
    return total

# Before making a request
if count_tokens(messages, tokenizer) > 120000:
    print("Warning: approaching context limit")

Hierarchical history

For very long sessions, use a two-tier history:

Recent turns stored verbatim in a rolling buffer
Older turns summarized and stored as a compressed context block

class HierarchicalMemory:
    def __init__(self, window_size: int = 10, summary_threshold: int = 30):
        self.recent = []  # Rolling buffer
        self.summary = "No prior context."
        self.window_size = window_size
        self.summary_threshold = summary_threshold
        self.turn_count = 0
    
    def add(self, user_msg: str, assistant_msg: str):
        self.recent.append({"user": user_msg, "assistant": assistant_msg})
        self.turn_count += 1
        
        if len(self.recent) > self.window_size:
            self.recent.pop(0)
        
        if self.turn_count == self.summary_threshold:
            self._generate_summary()
    
    def get_context(self) -> str:
        return f"Summary of earlier conversation: {self.summary}\n\nRecent turns:\n" + \
               "\n".join(f"User: {m['user']}\nAssistant: {m['assistant']}" for m in self.recent)