What this does

Without conversation history, a RAG pipeline treats every question in isolation, missing references to earlier mentions such as "the second point" or "that company." Context-aware RAG threads prior exchanges and current queries together so the system understands follow-up intent and pronouns. This produces more natural, coherent conversations over your documents.

Steps

Import conversation components. LangChain's ConversationalRetrievalChain manages history automatically.

import os
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

Index documents as usual. Build the vector store for retrieval.

from langchain_community.document_loaders import TextLoader

docs = TextLoader("context/meeting_notes.txt").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
embeddings = OllamaEmbeddings(model="llama3")
db = Chroma.from_documents(chunks, embeddings)

Create a conversation memory buffer. The buffer stores message history between turns.

memory = ConversationBufferMemory(
    memory_key="chat_history",
    output_key="answer",
    return_messages=True,
)

Build the conversational retrieval chain. This chain merges history and the current query.

llm = ChatOllama(model="llama3")
chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=db.as_retriever(),
    memory=memory,
)

Ask a follow-up question. The chain interprets "it" or "that" using prior context.

first_response = chain.invoke({"question": "What were the main decisions?"})
print(first_response["answer"])

follow_up_response = chain.invoke({"question": "Who approved them?"})
print(follow_up_response["answer"])

Expected output: the second answer references the previous conversation without re-explaining context.

Verification

python -c "
from langchain.memory import ConversationBufferMemory
m = ConversationBufferMemory()
m.chat_memory.add_user_message('What is RAG?')
m.chat_memory.add_ai_message('RAG is retrieval-augmented generation.')
print(m.chat_memory.messages[0].content)
# Expected: What is RAG?
"

Common failures

Memory growing unbounded. Set a max_token_limit in the memory object to prevent context window overflow.
Retriever ignoring chat history. Verify memory_key="chat_history" matches the chain's expected parameter name.
Conflicting pronouns. When a follow-up refers to multiple prior topics, the chain may pick the wrong one; add explicit disambiguation in the prompt.
Embedding context not refreshed. If documents change, rebuild the vector store; stale embeddings cause irrelevant retrieval.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Build Context-Aware RAG with Follow-Up Questions

What this does

Steps

Verification

Common failures

Related guides