HOW-TO · RAG

How to Build Conversational RAG with Memory

intermediate30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

LangChain installed, Ollama running

What this does

Standard RAG answers a single question at a time without remembering earlier exchanges. Conversational RAG with memory extends the pipeline by storing chat history and injecting it into every new retrieval. This enables multi-turn interactions where the system understands "the previous article" or "what you mentioned about embeddings" without requiring the user to repeat context.

Steps

  1. Import memory and chain components. Use LangChain's built-in conversation abstractions.

    import os
    os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
    
    from langchain_ollama import ChatOllama, OllamaEmbeddings
    from langchain_community.vectorstores import Chroma
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.chains import ConversationalRetrievalChain
    from langchain.memory import ConversationBufferWindowMemory
    
  2. Index documents into a vector store. The corpus provides the knowledge base.

    from langchain_community.document_loaders import TextLoader
    
    docs = TextLoader("context/product_docs.txt").load()
    chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
    embeddings = OllamaEmbeddings(model="llama3")
    db = Chroma.from_documents(chunks, embeddings)
    
  3. Create a windowed memory. ConversationBufferWindowMemory keeps only the last N turns, preventing unbounded growth.

    memory = ConversationBufferWindowMemory(
        k=5,
        memory_key="chat_history",
        output_key="answer",
        return_messages=True,
    )
    
  4. Build the conversational retrieval chain. Combine the retriever, LLM, and memory.

    llm = ChatOllama(model="llama3")
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=db.as_retriever(),
        memory=memory,
        max_tokens_limit=4096,
    )
    
  5. Run a multi-turn conversation. Memory persists across turns automatically.

    print(chain.invoke({"question": "What features does the pro tier include?"})["answer"])
    print(chain.invoke({"question": "Are they billed monthly or annually?"})["answer"])
    print(chain.invoke({"question": "What about the enterprise plan?"})["answer"])
    

    Expected output: each answer is informed by the full conversation history and relevant retrieved context.

Verification

python -c "
from langchain.memory import ConversationBufferWindowMemory
mem = ConversationBufferWindowMemory(k=3)
mem.chat_memory.add_user_message('Hello')
mem.chat_memory.add_ai_message('Hi there')
print(len(mem.chat_memory.messages))
# Expected: 2
"

Common failures

  • Context window overflow. Set max_tokens_limit to a value lower than the model's context size; otherwise, extremely long conversations cause errors.
  • Memory not influencing retrieval. Ensure memory_key matches the parameter expected by the chain; a mismatch means history is ignored.
  • Old turns being forgotten. With k=3, only the last 3 exchanges are kept; adjust the window size based on conversation complexity.
  • Stale retrieved context after memory update. The chain re-retrieves on each call using the updated chat history, so context is always current.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

RELATED GUIDES