RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Build Context-Aware RAG with Follow-Up Questions
HOW-TO · RAG

How to Build Context-Aware RAG with Follow-Up Questions

intermediate·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline with conversational memory

What this does

Without conversation history, a RAG pipeline treats every question in isolation, missing references to earlier mentions such as "the second point" or "that company." Context-aware RAG threads prior exchanges and current queries together so the system understands follow-up intent and pronouns. This produces more natural, coherent conversations over your documents.

Steps

  1. Import conversation components. LangChain's ConversationalRetrievalChain manages history automatically.

    import os
    os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
    
    from langchain_ollama import ChatOllama, OllamaEmbeddings
    from langchain_community.vectorstores import Chroma
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.chains import ConversationalRetrievalChain
    from langchain.memory import ConversationBufferMemory
    
  2. Index documents as usual. Build the vector store for retrieval.

    from langchain_community.document_loaders import TextLoader
    
    docs = TextLoader("context/meeting_notes.txt").load()
    chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
    embeddings = OllamaEmbeddings(model="llama3")
    db = Chroma.from_documents(chunks, embeddings)
    
  3. Create a conversation memory buffer. The buffer stores message history between turns.

    memory = ConversationBufferMemory(
        memory_key="chat_history",
        output_key="answer",
        return_messages=True,
    )
    
  4. Build the conversational retrieval chain. This chain merges history and the current query.

    llm = ChatOllama(model="llama3")
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=db.as_retriever(),
        memory=memory,
    )
    
  5. Ask a follow-up question. The chain interprets "it" or "that" using prior context.

    first_response = chain.invoke({"question": "What were the main decisions?"})
    print(first_response["answer"])
    
    follow_up_response = chain.invoke({"question": "Who approved them?"})
    print(follow_up_response["answer"])
    

    Expected output: the second answer references the previous conversation without re-explaining context.

Verification

python -c "
from langchain.memory import ConversationBufferMemory
m = ConversationBufferMemory()
m.chat_memory.add_user_message('What is RAG?')
m.chat_memory.add_ai_message('RAG is retrieval-augmented generation.')
print(m.chat_memory.messages[0].content)
# Expected: What is RAG?
"

Common failures

  • Memory growing unbounded. Set a max_token_limit in the memory object to prevent context window overflow.
  • Retriever ignoring chat history. Verify memory_key="chat_history" matches the chain's expected parameter name.
  • Conflicting pronouns. When a follow-up refers to multiple prior topics, the chain may pick the wrong one; add explicit disambiguation in the prompt.
  • Embedding context not refreshed. If documents change, rebuild the vector store; stale embeddings cause irrelevant retrieval.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • build-conversational-rag-memory
  • add-query-expansion-improve-recall
RELATED GUIDES
RAG
How to Build Conversational RAG with Memory
RAG
How to Add Query Expansion to Improve Recall
← All how-to guidesCourses →