10. Memory: Summary
Conversation history grows unbounded unless you compress it. Summary memory keeps only a condensed representation of past messages, trading perfect recall for constant memory footprint. This matters when running models with 4K-8K context windows where you cannot afford to feed 200 previous turns into every inference call.
LangChain provides ConversationSummaryMemory which runs after each conversation turn. It passes all messages to an LLM and stores the resulting summary string instead of the raw messages.
from langchain.memory import ConversationSummaryMemory
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.2", base_url="http://localhost:11434")
memory = ConversationSummaryMemory(llm=llm)
memory.save_context({"input": "I'm working on a RAG pipeline"}, {"output": "Good choice. What chunk size?"})
memory.save_context({"input": "512 tokens"}, {"output": "Solid for retrieval balance."})
# Check what actually got stored
print(memory.chat_memory.messages) # Original messages present here
print(memory.load_memory_variables({})) # Summary string here
The distinction between chat_memory.messages and load_memory_variables({}) trips people up. Messages stay raw in the chat history. The summary lives in memory_variables and gets rebuilt only when you call load_memory_variables().
For production systems, use ConversationSummaryBufferMemory when you need both recency and compression. It maintains a rolling window of raw messages and summarizes older content.
from langchain.memory import ConversationSummaryBufferMemory
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=1000, # Summarize when exceeding this
return_messages=True
)
# After 10+ exchanges, only recent 1000 tokens stay raw
for i in range(15):
memory.save_context({"input": f"Turn {i}"}, {"output": f"Response {i}"})
vars = memory.load_memory_variables({})
print(len(vars["history"])) # Smaller than 15 * 2 messages
The max_token_limit parameter controls when summarization triggers. Set it to 20-30% of your model's context window for reliable behavior.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run a loop that saves 20 message pairs to ConversationSummaryBufferMemory with max_token_limit=500. Print the token count of memory.load_memory_variables({})["history"] and verify it stays below 500.