11. Conversation History
Chapter 11 of 16 · 20 min
Conversation history is the raw material that feeds the agent's next reasoning step. How you format, store, and retrieve it directly affects how well the agent maintains context and avoids repeating itself.
Message role conventions
Standard roles:
system: Global instructions and persona (never dropped unless context is full)user: Human inputassistant: Model responses, including tool callstool: Results from tool invocations
Preserving tool call context
When a tool is called, include both the call metadata and the result in the history. This allows the model to understand what happened:
messages = [
{"role": "system", "content": "You are a helpful research assistant."},
{"role": "user", "content": "What is 15% of 200?"}
]
# Turn 1: model calls calculator
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)
messages.append({
"role": "assistant",
"content": "", # No text, just tool call
"tool_calls": response.message.tool_calls
})
# Turn 2: tool result returned
messages.append({
"role": "tool",
"tool_call_id": response.message.tool_calls[0].id,
"content": "30.0"
})
# Turn 3: follow-up or final response
response = ollama.chat(model="llama3.2", messages=messages, tools=tool_schemas)
Context window management
Every model has a context window limit. Llama 3.1 8B supports 128K tokens, but local hardware constrains this. Monitor usage:
def count_tokens(messages: list, tokenizer) -> int:
total = 0
for msg in messages:
total += len(tokenizer.encode(msg["content"]))
return total
# Before making a request
if count_tokens(messages, tokenizer) > 120000:
print("Warning: approaching context limit")
Hierarchical history
For very long sessions, use a two-tier history:
- Recent turns stored verbatim in a rolling buffer
- Older turns summarized and stored as a compressed context block
class HierarchicalMemory:
def __init__(self, window_size: int = 10, summary_threshold: int = 30):
self.recent = [] # Rolling buffer
self.summary = "No prior context."
self.window_size = window_size
self.summary_threshold = summary_threshold
self.turn_count = 0
def add(self, user_msg: str, assistant_msg: str):
self.recent.append({"user": user_msg, "assistant": assistant_msg})
self.turn_count += 1
if len(self.recent) > self.window_size:
self.recent.pop(0)
if self.turn_count == self.summary_threshold:
self._generate_summary()
def get_context(self) -> str:
return f"Summary of earlier conversation: {self.summary}\n\nRecent turns:\n" + \
"\n".join(f"User: {m['user']}\nAssistant: {m['assistant']}" for m in self.recent)
EXERCISE
Build a token counter that tracks message history size. Insert a random conversation of 50+ turns and verify the counter correctly flags approaching limits at 80%, 90%, and 100% of a target context window.