How to Build Conversational RAG with Memory
LangChain installed, Ollama running
What this does
Standard RAG answers a single question at a time without remembering earlier exchanges. Conversational RAG with memory extends the pipeline by storing chat history and injecting it into every new retrieval. This enables multi-turn interactions where the system understands "the previous article" or "what you mentioned about embeddings" without requiring the user to repeat context.
Steps
Import memory and chain components. Use LangChain's built-in conversation abstractions.
import os os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434" from langchain_ollama import ChatOllama, OllamaEmbeddings from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferWindowMemoryIndex documents into a vector store. The corpus provides the knowledge base.
from langchain_community.document_loaders import TextLoader docs = TextLoader("context/product_docs.txt").load() chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs) embeddings = OllamaEmbeddings(model="llama3") db = Chroma.from_documents(chunks, embeddings)Create a windowed memory.
ConversationBufferWindowMemorykeeps only the last N turns, preventing unbounded growth.memory = ConversationBufferWindowMemory( k=5, memory_key="chat_history", output_key="answer", return_messages=True, )Build the conversational retrieval chain. Combine the retriever, LLM, and memory.
llm = ChatOllama(model="llama3") chain = ConversationalRetrievalChain.from_llm( llm=llm, retriever=db.as_retriever(), memory=memory, max_tokens_limit=4096, )Run a multi-turn conversation. Memory persists across turns automatically.
print(chain.invoke({"question": "What features does the pro tier include?"})["answer"]) print(chain.invoke({"question": "Are they billed monthly or annually?"})["answer"]) print(chain.invoke({"question": "What about the enterprise plan?"})["answer"])Expected output: each answer is informed by the full conversation history and relevant retrieved context.
Verification
python -c "
from langchain.memory import ConversationBufferWindowMemory
mem = ConversationBufferWindowMemory(k=3)
mem.chat_memory.add_user_message('Hello')
mem.chat_memory.add_ai_message('Hi there')
print(len(mem.chat_memory.messages))
# Expected: 2
"
Common failures
- Context window overflow. Set
max_tokens_limitto a value lower than the model's context size; otherwise, extremely long conversations cause errors. - Memory not influencing retrieval. Ensure
memory_keymatches the parameter expected by the chain; a mismatch means history is ignored. - Old turns being forgotten. With
k=3, only the last 3 exchanges are kept; adjust the window size based on conversation complexity. - Stale retrieved context after memory update. The chain re-retrieves on each call using the updated chat history, so context is always current.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.