RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Add Query Expansion to Improve Recall
HOW-TO · RAG

How to Add Query Expansion to Improve Recall

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline running, LLM available for expansion

What this does

Short or vague queries often retrieve too few relevant documents because the exact terms do not appear in the corpus. Query expansion uses an LLM to generate related sub-queries, synonyms, or rephrasings, then merges the results. This broadens the retrieval surface and surfaces documents that would otherwise be missed by exact-match systems.

Steps

  1. Import required modules. Set up the LLM and vector store.

    import os
    os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
    
    from langchain_ollama import ChatOllama, OllamaEmbeddings
    from langchain_community.vectorstores import Chroma
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import TextLoader
    
  2. Build the vector store from documents.

    docs = TextLoader("context/technical_docs.txt").load()
    chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
    embeddings = OllamaEmbeddings(model="llama3")
    db = Chroma.from_documents(chunks, embeddings)
    
  3. Define a query expansion prompt. Instruct the LLM to generate alternative phrasings.

    from langchain.prompts import PromptTemplate
    
    expansion_prompt = PromptTemplate.from_template(
        """Given the user query, generate 3 alternative phrasings that cover different aspects.
    Original query: {query}
    Alternative phrasings (one per line):"""
    )
    
  4. Generate expanded queries and retrieve. Run the LLM to produce variants, then retrieve for each.

    llm = ChatOllama(model="llama3")
    original = "How does indexing affect query speed?"
    
    response = llm.invoke(expansion_prompt.format(query=original))
    variants = [line.strip() for line in response.content.split("\n") if line.strip()]
    all_results = {}
    for variant in [original] + variants:
        docs = db.similarity_search(variant, k=5)
        for doc in docs:
            all_results[doc.page_content] = doc
    merged = list(all_results.values())
    print(f"Retrieved {len(merged)} unique chunks from {len(variants)} queries")
    

    Expected output: a merged list of unique documents retrieved across all query variants.

  5. Feed the expanded context to the LLM. The combined documents provide broader coverage.

    context = "\n\n".join([d.page_content for d in merged[:5]])
    answer = llm.invoke(f"Context:\n{context}\n\nQuestion: {original}")
    print(answer.content)
    

Verification

python -c "
from langchain_ollama import ChatOllama
import os
os.environ['OLLAMA_BASE_URL'] = 'http://localhost:11434'
llm = ChatOllama(model='llama3')
result = llm.invoke('Generate one synonym for the word retrieval')
print(len(result.content) > 0)
# Expected: True
"

Common failures

  • Expansion generating irrelevant variants. Constrain the prompt to produce semantically related rephrasings rather than unrelated questions.
  • Too many variants causing latency. Limit to 3-5 expansions; excessive variants slow retrieval and increase context length.
  • Duplicate results inflating the merged set. Deduplicate by content hash before passing context to the LLM.
  • Duplicate content overwhelming the context window. Use a reranker after merging to select the most diverse set of documents.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • build-basic-rag-pipeline-langchain
  • add-reranking-rag-pipeline
RELATED GUIDES
RAG
How to Build a Basic RAG Pipeline with LangChain
RAG
How to Add Reranking to Your RAG Pipeline
← All how-to guidesCourses →