RetrievalQA Chain — LangChain for Local AI (Chapter 15)

RetrievalQA chains come in multiple types controlling how retrieved documents combine with the query. Understanding the tradeoffs prevents expensive mistakes in production.

stuff mode concatenates all retrieved documents into one prompt. Fastest, cheapest, but fails if combined documents exceed context length.

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    prompt=prompt
)

map_reduce sends each document separately to the LLM, gets individual answers, then combines them. Handles any number of documents but makes N+1 LLM calls (one per document plus one for final answer).

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="map_reduce",
    map_prompt=PromptTemplate.from_template("""Answer this based on the document:

Question: {question}

Document: {context}"""),
    combine_prompt=PromptTemplate.from_template("""Combine these partial answers:

{answers}

Original question: {question}

Final answer:""")
)

refine processes documents sequentially, updating its answer as it sees each new document. Better for ordering-sensitive content like procedures or narratives.

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="refine"
)

For debugging, inspect intermediate steps.

from langchain.chains import RetrievalQA

# Enable verbose to see retrieval steps
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    return_intermediate_steps=True
)

result = qa_chain.invoke({"query": "your question"})
print("Intermediate steps:", result["intermediate_steps"])

Common failure: chain_type="refine" with contradictory documents produces unstable results. The model flips between conflicting answers as each new document arrives.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.