How to Use Query Rewriting for Better Recall
LLM running locally, RAG pipeline set up
What this does
Query rewriting uses a local LLM to transform ambiguous, incomplete, or colloquial user queries into formulations that better match your document corpus. A query like "it crashed after update" returns nothing relevant, but rewriting it to "application crashes following software update" retrieves the correct documentation. This technique dramatically improves recall for real-world queries that differ in phrasing from your stored documents.
Steps
Define rewriting strategies appropriate for your domain. Common approaches include: synonym expansion, domain term injection, ambiguity resolution, and multi-perspective reformulation (generating 2–3 variations).
Implement the rewriting call against your local LLM:
import ollama
def rewrite_query(query: str, domain: str) -> list[str]:
prompt = f"""You are a {domain} search expert.
Rewrite the following query in 2 different ways that would help retrieve
relevant documents. Return each rewrite on its own line.
Query: {query}
Rewrites:"""
response = ollama.chat(
model="llama3",
messages=[{"role": "user", "content": prompt}],
)
rewrites = [query] # include original
rewrites += [line.strip() for line in response["message"]["content"].splitlines() if line.strip()]
return rewrites[:3]
Integrate into your retrieval pipeline. After receiving a user query, call
rewrite_query, then run vector search for each rewritten version and merge the result sets.Apply deduplication on merged results since different rewrites may retrieve the same documents. Use document ID as the dedup key.
Set a generation budget. Limit the number of rewrites to 3–5 to avoid excessive latency. One or two rewrites in parallel is usually sufficient.
Log rewrite inputs and outputs to identify patterns where rewriting helps or hurts, and refine your prompt accordingly.
Verification
python test_rewriting.py --query "why does the system slow down when loading large files" \
--dataset eval_queries.jsonl
Expected output: A comparison table showing original retrieval Recall@10 versus rewritten retrieval Recall@10. Rewriting should improve recall by at least 10–30% for queries with vocabulary gaps. Example:
Original Recall@10: 0.52
Rewritten Recall@10: 0.78
Delta: +0.26 (50% relative improvement)
Confirm the rewritten queries themselves are sensible by manually inspecting the generated rewrites in the verbose output.
Common failures
- LLM adds hallucinated context: The model introduces facts or terminology not present in the original query, causing the rewritten query to retrieve unrelated documents. Constrain rewrites to paraphrasing rather than elaboration.
- Latency bottleneck: A large model for rewriting adds 1–3 seconds per query. Use a small, fast model (under 8B parameters) and limit to one rewrite call.
- Rewrites drift from user intent: Over-aggressive synonym substitution can shift meaning. Keep the original query in the result set and compare performance to detect drift.
- Inconsistent output format: Free-form LLM responses can be difficult to parse reliably. Use structured prompts or JSON-mode if your inference server supports it.
- No improvement on short queries: Rewriting short or already-precise queries (e.g., proper nouns) can add latency without benefit. Skip rewriting for queries under 5 tokens when confidence is high.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- improve-embedding-quality-retrieval
- implement-hybrid-search-keyword-semantic