HOW-TO · RAG
How to Implement Agent Memory (Short and Long Term)
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Agent framework, vector store for long-term memory, Python 3.10+
What this does
Short-term memory holds the current conversation within the context window. Long-term memory persists important facts across sessions using a vector store, enabling the agent to recall user preferences and past interactions.
Steps
- Implement short-term memory as a sliding window. Keep the most recent N messages.
from collections import deque
class ShortTermMemory:
def __init__(self, max_messages: int = 10):
self.messages = deque(maxlen=max_messages)
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_context(self) -> list[dict]:
return list(self.messages)
def summarize(self, llm) -> str:
text = "\n".join(m["content"] for m in self.messages)
summary = llm.invoke(f"Summarize this conversation:\n{text}")
return summary.content
- Build long-term memory with a vector store. Store facts as embeddings.
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.schema import Document
class LongTermMemory:
def __init__(self, collection_name: str = "agent_memory"):
self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
self.store = Chroma(
embedding_function=self.embeddings,
collection_name=collection_name
)
def remember(self, fact: str, metadata: dict = None):
doc = Document(page_content=fact, metadata=metadata or {})
self.store.add_documents([doc])
def recall(self, query: str, k: int = 5) -> list[str]:
docs = self.store.similarity_search(query, k=k)
return [d.page_content for d in docs]
- Decide what to store long-term. Use the LLM to extract salient facts.
def extract_facts(conversation: str, llm) -> list[str]:
prompt = f"""Extract important facts from this conversation that should be remembered long-term.
Return each fact on a new line.
Conversation:
{conversation}"""
response = llm.invoke(prompt)
return [f.strip() for f in response.content.split("\n") if f.strip()]
- Integrate both memory types into the agent loop.
class MemoryAugmentedAgent:
def __init__(self, llm, tools, stm: ShortTermMemory, ltm: LongTermMemory):
self.llm = llm
self.tools = tools
self.stm = stm
self.ltm = ltm
def run(self, user_input: str) -> str:
# Retrieve relevant long-term memories
memories = self.ltm.recall(user_input)
memory_context = "\n".join(f"Past fact: {m}" for m in memories)
# Build prompt with both memory types
prompt = f"{memory_context}\n\nConversation:\n"
for msg in self.stm.get_context():
prompt += f"{msg['role']}: {msg['content']}\n"
prompt += f"user: {user_input}"
response = self.llm.invoke(prompt)
# Update short-term memory
self.stm.add("user", user_input)
self.stm.add("assistant", response.content)
# Periodically extract long-term facts
if len(self.stm.messages) >= 5:
facts = extract_facts(prompt, self.llm)
for fact in facts:
self.ltm.remember(fact)
return response.content
Verification
python -c "
from collections import deque
stm = deque(maxlen=3)
stm.append('msg1')
stm.append('msg2')
stm.append('msg3')
stm.append('msg4')
print(len(stm))
# Expected: 3 (sliding window)
"
Common failures
- Memory staleness. Long-term memory stores facts that become outdated. Add an expiration timestamp or allow the user to delete memories.
- Context window overflow from short-term memory. Even a sliding window of 20 messages can overflow a small context. Store summaries instead of raw messages.
- Fact extraction captures trivial details. The LLM stores "User said 'OK'" as a fact. Use a stricter extraction prompt that requires informational value.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- How to Use Vector Store as Agent Memory
- How to Manage Agent Context Window Limits