Agentic Retrieval — RAG Systems: Part 2 (Chapter 11)

Agentic retrieval uses an LLM as an agent that reasons about queries, decomposes complex questions, and executes multi-step retrieval chains. Unlike static pipelines, agentic retrieval enables dynamic, self-correcting search strategies.

When Static Pipelines Fail

Static pipelines (query → retrieval → answer) have fixed logic. They can't:

Recognize when initial retrieval failed to find relevant information
Decompose multi-hop questions that require chaining multiple searches
Adjust strategy mid-retrieval based on partial results
Ask clarifying questions when queries are ambiguous

Agentic retrieval addresses these by putting the LLM in control of the retrieval process.

The ReAct Pattern

ReAct (Reasoning + Acting) interleaves reasoning traces with action executions:

Thought: I need to find information about X
Action: retrieve(query=X)
Observation: Retrieved 5 documents
Thought: Document 3 mentions Y, I need more details about Y
Action: retrieve(query=Y specifically)
Observation: Retrieved additional documents
Thought: Now I have enough information to answer the original question
Final Answer: ...

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.tools import tool

@tool
def retrieve_documents(query: str, k: int = 5) -> str:
    """Retrieve relevant documents from the knowledge base.
    
    Args:
        query: Search query
        k: Number of documents to retrieve (default 5)
    
    Returns:
        Retrieved document contents as a string
    """
    results = vectorstore.similarity_search(query, k=k)
    return "\n---\n".join([f"[Document {i+1}]:\n{doc.page_content}" 
                           for i, doc in enumerate(results)])

@tool
def rewrite_query(query: str) -> str:
    """Rewrite query to better match document vocabulary."""
    # Implementation from Chapter 6
    ...

class ReActRetriever:
    def __init__(self, tools, llm):
        self.tools = tools
        self.llm = llm
        
        prompt = PromptTemplate.from_template("""
You are a research assistant. Your goal is to answer user questions by 
retrieving information from the knowledge base.

You have access to these tools:
{tools}

Question: {input}

Follow this format:
Thought: [what you're thinking about next]
Action: [tool name]
Action Input: [input to the tool]
Observation: [result from the tool]
... (repeat Thought/Action/Observation as needed)
Final Answer: [your final answer]""")
        
        agent = create_openai_functions_agent(llm, self.tools, prompt)
        self.executor = AgentExecutor(agent=agent, tools=self.tools, verbose=True)
    
    def retrieve(self, query, max_steps=10):
        """
        Agentic retrieval with self-correction.
        
        Args:
            query: User question
            max_steps: Maximum retrieval steps before forcing answer
        """
        try:
            result = self.executor.invoke(
                {"input": query},
                {"max_iterations": max_steps}
            )
            return {
                'answer': result['output'],
                'steps': result.get('steps', []),
                'retrieval_count': count_retrieval_calls(result)
            }
        except Exception as e:
            return {
                'answer': f"Error during retrieval: {str(e)}",
                'steps': [],
                'retrieval_count': 0
            }

Multi-Hop Retrieval

Multi-hop questions require chaining multiple retrievals where each step depends on previous results:

def multi_hop_agent(query, vectorstore, llm):
    """
    Multi-hop retrieval that chains queries based on intermediate results.
    """
    # Parse the question to identify hops
    hop_plan = llm.invoke(f"""Analyze this question and identify the retrieval hops needed.
Each hop should be answerable by retrieving a single document or set of documents.

Question: {query}

Breakdown:""")
    
    # Parse planned hops
    hops = parse_hops(hop_plan.content)
    
    context = ""
    hop_results = []
    
    for i, hop in enumerate(hops):
        # Substitute context from previous hops into current query
        current_query = substitute_context(hop['query'], context)
        
        # Execute retrieval
        docs = vectorstore.similarity_search(current_query, k=5)
        hopped_context = format_documents(docs)
        
        context += f"\n\n[Hop {i+1}: {hop['topic']}]\n{hopped_context}"
        hop_results.append({
            'hop': i+1,
            'query': current_query,
            'results': docs
        })
    
    return {
        'context': context,
        'hop_results': hop_results
    }

# Example multi-hop question
# "Who approved the contract with the vendor and what was the total value?"
# Hop 1: Find the vendor contract
# Hop 2: Identify who approved it, extract the approval
# Hop 3: Extract contract value

Self-Correction Loop

Agentic retrieval can detect and correct failures:

def self_correcting_retrieval(query, vectorstore, llm, max_attempts=3):
    """
    Retrieval with automatic self-correction.
    """
    attempt = 0
    all_retrieved = []
    
    while attempt < max_attempts:
        attempt += 1
        
        # Current state
        current_query = query if attempt == 1 else modified_query
        
        # Retrieve
        results = vectorstore.similarity_search(current_query, k=10)
        new_docs = [doc for doc in results if doc not in all_retrieved]
        all_retrieved.extend(new_docs)
        
        # Check if retrieval is sufficient
        sufficiency_check = llm.invoke(f"""
Given these retrieved documents:
{format_documents(results)}

And the original question: {query}

Is this sufficient to answer the question? If not, what information is missing?
What new query would retrieve the missing information?

Answer format:
Sufficient: YES/NO
Missing information: [description or N/A]
New query: [query or N/A]
""")
        
        if "Sufficient: YES" in sufficiency_check.content:
            return {
                'documents': all_retrieved,
                'attempts': attempt,
                'sufficient': True
            }
        
        # Parse new query for retry
        modified_query = extract_new_query(sufficiency_check.content)
        
        if not modified_query:
            break
    
    return {
        'documents': all_retrieved,
        'attempts': attempt,
        'sufficient': False
    }

Tool Use Efficiency

Agentic retrieval can be expensive because the agent may make many retrieval calls. Optimize by:

Caching retrieval results: Store embeddings and BM25 scores for common sub-queries.

Parallel retrieval: When multiple independent hops are identified, execute them concurrently.

from concurrent.futures import ThreadPoolExecutor

def parallel_retrieval(queries, vectorstore):
    """Execute multiple retrieval queries in parallel."""
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = {
            executor.submit(vectorstore.similarity_search, q, k=5): q 
            for q in queries
        }
        results = {}
        for future in futures:
            query = futures[future]
            try:
                results[query] = future.result()
            except Exception as e:
                results[query] = []
                log_error(f"Retrieval failed for {query}: {e}")
    return results

Early stopping: If confidence is high after a few retrieval steps, stop early rather than continuing to max_steps.