How to Build Context-Aware RAG with Follow-Up Questions
RAG pipeline with conversational memory
What this does
Without conversation history, a RAG pipeline treats every question in isolation, missing references to earlier mentions such as "the second point" or "that company." Context-aware RAG threads prior exchanges and current queries together so the system understands follow-up intent and pronouns. This produces more natural, coherent conversations over your documents.
Steps
Import conversation components. LangChain's
ConversationalRetrievalChainmanages history automatically.import os os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434" from langchain_ollama import ChatOllama, OllamaEmbeddings from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemoryIndex documents as usual. Build the vector store for retrieval.
from langchain_community.document_loaders import TextLoader docs = TextLoader("context/meeting_notes.txt").load() chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs) embeddings = OllamaEmbeddings(model="llama3") db = Chroma.from_documents(chunks, embeddings)Create a conversation memory buffer. The buffer stores message history between turns.
memory = ConversationBufferMemory( memory_key="chat_history", output_key="answer", return_messages=True, )Build the conversational retrieval chain. This chain merges history and the current query.
llm = ChatOllama(model="llama3") chain = ConversationalRetrievalChain.from_llm( llm=llm, retriever=db.as_retriever(), memory=memory, )Ask a follow-up question. The chain interprets "it" or "that" using prior context.
first_response = chain.invoke({"question": "What were the main decisions?"}) print(first_response["answer"]) follow_up_response = chain.invoke({"question": "Who approved them?"}) print(follow_up_response["answer"])Expected output: the second answer references the previous conversation without re-explaining context.
Verification
python -c "
from langchain.memory import ConversationBufferMemory
m = ConversationBufferMemory()
m.chat_memory.add_user_message('What is RAG?')
m.chat_memory.add_ai_message('RAG is retrieval-augmented generation.')
print(m.chat_memory.messages[0].content)
# Expected: What is RAG?
"
Common failures
- Memory growing unbounded. Set a
max_token_limitin the memory object to prevent context window overflow. - Retriever ignoring chat history. Verify
memory_key="chat_history"matches the chain's expected parameter name. - Conflicting pronouns. When a follow-up refers to multiple prior topics, the chain may pick the wrong one; add explicit disambiguation in the prompt.
- Embedding context not refreshed. If documents change, rebuild the vector store; stale embeddings cause irrelevant retrieval.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.