How to Add Query Expansion to Improve Recall
RAG pipeline running, LLM available for expansion
What this does
Short or vague queries often retrieve too few relevant documents because the exact terms do not appear in the corpus. Query expansion uses an LLM to generate related sub-queries, synonyms, or rephrasings, then merges the results. This broadens the retrieval surface and surfaces documents that would otherwise be missed by exact-match systems.
Steps
Import required modules. Set up the LLM and vector store.
import os os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434" from langchain_ollama import ChatOllama, OllamaEmbeddings from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import TextLoaderBuild the vector store from documents.
docs = TextLoader("context/technical_docs.txt").load() chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs) embeddings = OllamaEmbeddings(model="llama3") db = Chroma.from_documents(chunks, embeddings)Define a query expansion prompt. Instruct the LLM to generate alternative phrasings.
from langchain.prompts import PromptTemplate expansion_prompt = PromptTemplate.from_template( """Given the user query, generate 3 alternative phrasings that cover different aspects. Original query: {query} Alternative phrasings (one per line):""" )Generate expanded queries and retrieve. Run the LLM to produce variants, then retrieve for each.
llm = ChatOllama(model="llama3") original = "How does indexing affect query speed?" response = llm.invoke(expansion_prompt.format(query=original)) variants = [line.strip() for line in response.content.split("\n") if line.strip()] all_results = {} for variant in [original] + variants: docs = db.similarity_search(variant, k=5) for doc in docs: all_results[doc.page_content] = doc merged = list(all_results.values()) print(f"Retrieved {len(merged)} unique chunks from {len(variants)} queries")Expected output: a merged list of unique documents retrieved across all query variants.
Feed the expanded context to the LLM. The combined documents provide broader coverage.
context = "\n\n".join([d.page_content for d in merged[:5]]) answer = llm.invoke(f"Context:\n{context}\n\nQuestion: {original}") print(answer.content)
Verification
python -c "
from langchain_ollama import ChatOllama
import os
os.environ['OLLAMA_BASE_URL'] = 'http://localhost:11434'
llm = ChatOllama(model='llama3')
result = llm.invoke('Generate one synonym for the word retrieval')
print(len(result.content) > 0)
# Expected: True
"
Common failures
- Expansion generating irrelevant variants. Constrain the prompt to produce semantically related rephrasings rather than unrelated questions.
- Too many variants causing latency. Limit to 3-5 expansions; excessive variants slow retrieval and increase context length.
- Duplicate results inflating the merged set. Deduplicate by content hash before passing context to the LLM.
- Duplicate content overwhelming the context window. Use a reranker after merging to select the most diverse set of documents.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.