Multi-Hop RAG — RAG Systems: Part 2 (Chapter 12)

Basic RAG retrieves the most relevant chunks for a single query. Multi-hop RAG answers questions that require connecting information across multiple sources. "What was the revenue impact of the product delay mentioned in Q3, and which customers were affected?" requires finding the Q3 delay, then the revenue impact, then customer records.

The Problem with Single-Retrieval RAG

When you embed a complex question, the embedding captures the dominant topic but loses the nuance. A question about "revenue impact of product delay" embeds close to "product delays" and "revenue." The retrieved chunks might contain product delays from unrelated quarters or revenue figures without the delay context.

Iterative Retrieval

Multi-hop RAG uses multiple retrieval steps, where each step's results inform the next query.

import openai

def multi_hop_query(question: str, max_hops: int = 3) -> str:
    """Multi-hop retrieval with query reformulation."""
    context = []
    current_question = question
    
    for hop in range(max_hops):
        # Retrieve chunks based on current question
        chunks = vector_store.similarity_search(
            current_question,
            k=3,
            filter={"source": "internal_docs"}
        )
        
        context.extend(chunks)
        
        # Use LLM to decide if we have enough info
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": """Based on the context, can you answer the question? 
If yes, provide the answer. If no, generate a new search query that would find the missing information.
Return format: ANSWER: <your answer> or NEW_QUERY: <your refined query>"""},
                {"role": "user", "content": f"Question: {question}\nContext: {context}"}
            ]
        )
        
        result = response.choices[0].message.content
        
        if result.startswith("ANSWER:"):
            return result[8:].strip()
        
        # Extract new query for next hop
        if result.startswith("NEW_QUERY:"):
            current_question = result[10:].strip()
        else:
            break
    
    # Final answer from accumulated context
    return final_answer_from_context(question, context)

Graph-Based Retrieval

A more structured approach uses a document graph where nodes are chunks and edges represent relationships (same document, citation, temporal).

class GraphRetriever:
    def __init__(self, graph_db):
        self.graph = graph_db
        
    def retrieve_with_hops(self, question: str, depth: int = 2):
        # Find starting nodes via embedding similarity
        start_nodes = vector_store.similarity_search(question, k=5)
        node_ids = [n.id for n in start_nodes]
        
        # Traverse graph to specified depth
        visited = set(node_ids)
        frontier = node_ids.copy()
        
        for _ in range(depth):
            next_frontier = []
            for node_id in frontier:
                neighbors = self.graph.get_neighbors(node_id)
                for neighbor in neighbors:
                    if neighbor not in visited:
                        visited.add(neighbor)
                        next_frontier.append(neighbor)
            frontier = next_frontier
        
        # Retrieve content for all visited nodes
        return [self.graph.get_node_content(n) for n in visited]

Failure Modes

The iterative approach can diverge if the LLM generates irrelevant new queries. The graph approach requires building the graph upfront, which adds infrastructure. Both approaches increase latency linearly with hop count.