How to set up agent memory with vector databases
Vector database (ChromaDB/Faiss), agent framework
What this does
Setting up agent memory with vector databases gives AI agents persistent, semantic memory that persists across sessions. Instead of relying solely on the conversation window (which has limited context length), the agent stores past interactions, facts, and user preferences as vector embeddings in a database. When a new query arrives, the agent retrieves semantically similar past memories and includes them in the context. This enables the agent to recall previous conversations, maintain user-specific knowledge, and build a growing knowledge base.
Steps
Install dependencies: pip install chromadb langchain-ollama. Initialize the vector database client: import chromadb; client = chromadb.PersistentClient(path="./agent_memory"). Create a collection: collection = client.get_or_create_collection(name="agent_memories", metadata={"hnsw:space": "cosine"}). Set up the embedding function: from langchain_ollama import OllamaEmbeddings; embeddings = OllamaEmbeddings(model="nomic-embed-text"). Implement the memory manager class with two operations. Store: def store_memory(session_id: str, content: str, metadata: dict): embedding = embeddings.embed_query(content); collection.add(documents=[content], embeddings=[embedding], metadatas=[metadata], ids=[f"{session_id}_{uuid4()}"]). Retrieve: def retrieve_memories(session_id: str, query: str, k: int = 5): embedding = embeddings.embed_query(query); results = collection.query(query_embeddings=[embedding], n_results=k, where={"session_id": session_id}); return results["documents"][0]. Integrate into the agent's processing pipeline. Before calling the LLM, retrieve relevant memories: past_context = memory_manager.retrieve_memories(session_id, user_query). Format memories as a prefix to the system prompt: system_prompt = f"Past relevant context:\n{chr(10).join(past_context)}\n\nCurrent conversation:\n". After the agent generates a response, store the interaction: memory_manager.store_memory(session_id, f"User: {user_query}\nAssistant: {response}", {"timestamp": datetime.now().isoformat(), "type": "conversation"}). For user facts and preferences, store them separately with {"type": "user_fact"} metadata and always include them regardless of query similarity. Implement memory cleanup: periodically delete memories older than 90 days or trim the collection when it exceeds 10,000 entries.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Send a query containing specific information: "My favorite programming language is Rust." In a new session, send "What is my favorite programming language?" and verify the agent retrieves and references the stored fact. Check the vector database directly: collection.count() should return a positive integer. Query the collection with collection.peek(limit=5) and verify stored documents contain correct content and metadata. Test retrieval relevance: store 10 diverse memories, then query with a specific topic—the top results should be semantically related.
Common failures
Embedding dimension mismatch: The embedding model produces vectors of a fixed dimension (e.g., 768 for nomic-embed-text); ensure the collection is created with matching dimension or was auto-configured on first add. Stale memory pollution: Old or incorrect memories degrade retrieval quality—implement recency weighting or a maximum age filter in the retrieval query. Empty retrieval results: Check that the session_id where filter is correct; for global memories, remove the filter temporarily. Storage growth unbounded: Implement TTL-based cleanup with a cron job that runs collection.delete(ids=expired_ids). Embedding call latency slows down conversations: Batch memory storage to run asynchronously after the response is sent; use asyncio.create_task() for non-blocking writes.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- build-langgraph-agent-scratch
- build-rag-evaluation-pipeline
- build-code-generation-agent-local-models