What this does

Standard RAG answers a single question at a time without remembering earlier exchanges. Conversational RAG with memory extends the pipeline by storing chat history and injecting it into every new retrieval. This enables multi-turn interactions where the system understands "the previous article" or "what you mentioned about embeddings" without requiring the user to repeat context.

Steps

Import memory and chain components. Use LangChain's built-in conversation abstractions.

import os
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

Index documents into a vector store. The corpus provides the knowledge base.

from langchain_community.document_loaders import TextLoader

docs = TextLoader("context/product_docs.txt").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
embeddings = OllamaEmbeddings(model="llama3")
db = Chroma.from_documents(chunks, embeddings)

Create a windowed memory. ConversationBufferWindowMemory keeps only the last N turns, preventing unbounded growth.

memory = ConversationBufferWindowMemory(
    k=5,
    memory_key="chat_history",
    output_key="answer",
    return_messages=True,
)

Build the conversational retrieval chain. Combine the retriever, LLM, and memory.

llm = ChatOllama(model="llama3")
chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=db.as_retriever(),
    memory=memory,
    max_tokens_limit=4096,
)

Run a multi-turn conversation. Memory persists across turns automatically.

print(chain.invoke({"question": "What features does the pro tier include?"})["answer"])
print(chain.invoke({"question": "Are they billed monthly or annually?"})["answer"])
print(chain.invoke({"question": "What about the enterprise plan?"})["answer"])

Expected output: each answer is informed by the full conversation history and relevant retrieved context.

Verification

python -c "
from langchain.memory import ConversationBufferWindowMemory
mem = ConversationBufferWindowMemory(k=3)
mem.chat_memory.add_user_message('Hello')
mem.chat_memory.add_ai_message('Hi there')
print(len(mem.chat_memory.messages))
# Expected: 2
"

Common failures

Context window overflow. Set max_tokens_limit to a value lower than the model's context size; otherwise, extremely long conversations cause errors.
Memory not influencing retrieval. Ensure memory_key matches the parameter expected by the chain; a mismatch means history is ignored.
Old turns being forgotten. With k=3, only the last 3 exchanges are kept; adjust the window size based on conversation complexity.
Stale retrieved context after memory update. The chain re-retrieves on each call using the updated chat history, so context is always current.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Build Conversational RAG with Memory

What this does

Steps

Verification

Common failures

Related guides