12. Multi-Hop RAG
Basic RAG retrieves the most relevant chunks for a single query. Multi-hop RAG answers questions that require connecting information across multiple sources. "What was the revenue impact of the product delay mentioned in Q3, and which customers were affected?" requires finding the Q3 delay, then the revenue impact, then customer records.
The Problem with Single-Retrieval RAG
When you embed a complex question, the embedding captures the dominant topic but loses the nuance. A question about "revenue impact of product delay" embeds close to "product delays" and "revenue." The retrieved chunks might contain product delays from unrelated quarters or revenue figures without the delay context.
Iterative Retrieval
Multi-hop RAG uses multiple retrieval steps, where each step's results inform the next query.
import openai
def multi_hop_query(question: str, max_hops: int = 3) -> str:
"""Multi-hop retrieval with query reformulation."""
context = []
current_question = question
for hop in range(max_hops):
# Retrieve chunks based on current question
chunks = vector_store.similarity_search(
current_question,
k=3,
filter={"source": "internal_docs"}
)
context.extend(chunks)
# Use LLM to decide if we have enough info
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Based on the context, can you answer the question?
If yes, provide the answer. If no, generate a new search query that would find the missing information.
Return format: ANSWER: <your answer> or NEW_QUERY: <your refined query>"""},
{"role": "user", "content": f"Question: {question}\nContext: {context}"}
]
)
result = response.choices[0].message.content
if result.startswith("ANSWER:"):
return result[8:].strip()
# Extract new query for next hop
if result.startswith("NEW_QUERY:"):
current_question = result[10:].strip()
else:
break
# Final answer from accumulated context
return final_answer_from_context(question, context)
Graph-Based Retrieval
A more structured approach uses a document graph where nodes are chunks and edges represent relationships (same document, citation, temporal).
class GraphRetriever:
def __init__(self, graph_db):
self.graph = graph_db
def retrieve_with_hops(self, question: str, depth: int = 2):
# Find starting nodes via embedding similarity
start_nodes = vector_store.similarity_search(question, k=5)
node_ids = [n.id for n in start_nodes]
# Traverse graph to specified depth
visited = set(node_ids)
frontier = node_ids.copy()
for _ in range(depth):
next_frontier = []
for node_id in frontier:
neighbors = self.graph.get_neighbors(node_id)
for neighbor in neighbors:
if neighbor not in visited:
visited.add(neighbor)
next_frontier.append(neighbor)
frontier = next_frontier
# Retrieve content for all visited nodes
return [self.graph.get_node_content(n) for n in visited]
Failure Modes
The iterative approach can diverge if the LLM generates irrelevant new queries. The graph approach requires building the graph upfront, which adds infrastructure. Both approaches increase latency linearly with hop count.
Implement a two-hop retriever that first finds documents about a topic, then uses those results to find specific details within those documents. Test with "What was the acquisition price mentioned in the earnings call transcripts from 2023?"