RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 2
  6. /Ch. 11
RAG Systems: Part 2

11. Agentic Retrieval

Chapter 11 of 22 · 30 min
KEY INSIGHT

Agentic retrieval enables dynamic, self-correcting search by putting the LLM in control, allowing it to recognize failures, decompose complex questions, and chain multiple retrievals.

Agentic retrieval uses an LLM as an agent that reasons about queries, decomposes complex questions, and executes multi-step retrieval chains. Unlike static pipelines, agentic retrieval enables dynamic, self-correcting search strategies.

When Static Pipelines Fail

Static pipelines (query → retrieval → answer) have fixed logic. They can't:

  • Recognize when initial retrieval failed to find relevant information
  • Decompose multi-hop questions that require chaining multiple searches
  • Adjust strategy mid-retrieval based on partial results
  • Ask clarifying questions when queries are ambiguous

Agentic retrieval addresses these by putting the LLM in control of the retrieval process.

The ReAct Pattern

ReAct (Reasoning + Acting) interleaves reasoning traces with action executions:

Thought: I need to find information about X
Action: retrieve(query=X)
Observation: Retrieved 5 documents
Thought: Document 3 mentions Y, I need more details about Y
Action: retrieve(query=Y specifically)
Observation: Retrieved additional documents
Thought: Now I have enough information to answer the original question
Final Answer: ...
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.tools import tool

@tool
def retrieve_documents(query: str, k: int = 5) -> str:
    """Retrieve relevant documents from the knowledge base.
    
    Args:
        query: Search query
        k: Number of documents to retrieve (default 5)
    
    Returns:
        Retrieved document contents as a string
    """
    results = vectorstore.similarity_search(query, k=k)
    return "\n---\n".join([f"[Document {i+1}]:\n{doc.page_content}" 
                           for i, doc in enumerate(results)])

@tool
def rewrite_query(query: str) -> str:
    """Rewrite query to better match document vocabulary."""
    # Implementation from Chapter 6
    ...

class ReActRetriever:
    def __init__(self, tools, llm):
        self.tools = tools
        self.llm = llm
        
        prompt = PromptTemplate.from_template("""
You are a research assistant. Your goal is to answer user questions by 
retrieving information from the knowledge base.

You have access to these tools:
{tools}

Question: {input}

Follow this format:
Thought: [what you're thinking about next]
Action: [tool name]
Action Input: [input to the tool]
Observation: [result from the tool]
... (repeat Thought/Action/Observation as needed)
Final Answer: [your final answer]""")
        
        agent = create_openai_functions_agent(llm, self.tools, prompt)
        self.executor = AgentExecutor(agent=agent, tools=self.tools, verbose=True)
    
    def retrieve(self, query, max_steps=10):
        """
        Agentic retrieval with self-correction.
        
        Args:
            query: User question
            max_steps: Maximum retrieval steps before forcing answer
        """
        try:
            result = self.executor.invoke(
                {"input": query},
                {"max_iterations": max_steps}
            )
            return {
                'answer': result['output'],
                'steps': result.get('steps', []),
                'retrieval_count': count_retrieval_calls(result)
            }
        except Exception as e:
            return {
                'answer': f"Error during retrieval: {str(e)}",
                'steps': [],
                'retrieval_count': 0
            }

Multi-Hop Retrieval

Multi-hop questions require chaining multiple retrievals where each step depends on previous results:

def multi_hop_agent(query, vectorstore, llm):
    """
    Multi-hop retrieval that chains queries based on intermediate results.
    """
    # Parse the question to identify hops
    hop_plan = llm.invoke(f"""Analyze this question and identify the retrieval hops needed.
Each hop should be answerable by retrieving a single document or set of documents.

Question: {query}

Breakdown:""")
    
    # Parse planned hops
    hops = parse_hops(hop_plan.content)
    
    context = ""
    hop_results = []
    
    for i, hop in enumerate(hops):
        # Substitute context from previous hops into current query
        current_query = substitute_context(hop['query'], context)
        
        # Execute retrieval
        docs = vectorstore.similarity_search(current_query, k=5)
        hopped_context = format_documents(docs)
        
        context += f"\n\n[Hop {i+1}: {hop['topic']}]\n{hopped_context}"
        hop_results.append({
            'hop': i+1,
            'query': current_query,
            'results': docs
        })
    
    return {
        'context': context,
        'hop_results': hop_results
    }

# Example multi-hop question
# "Who approved the contract with the vendor and what was the total value?"
# Hop 1: Find the vendor contract
# Hop 2: Identify who approved it, extract the approval
# Hop 3: Extract contract value

Self-Correction Loop

Agentic retrieval can detect and correct failures:

def self_correcting_retrieval(query, vectorstore, llm, max_attempts=3):
    """
    Retrieval with automatic self-correction.
    """
    attempt = 0
    all_retrieved = []
    
    while attempt < max_attempts:
        attempt += 1
        
        # Current state
        current_query = query if attempt == 1 else modified_query
        
        # Retrieve
        results = vectorstore.similarity_search(current_query, k=10)
        new_docs = [doc for doc in results if doc not in all_retrieved]
        all_retrieved.extend(new_docs)
        
        # Check if retrieval is sufficient
        sufficiency_check = llm.invoke(f"""
Given these retrieved documents:
{format_documents(results)}

And the original question: {query}

Is this sufficient to answer the question? If not, what information is missing?
What new query would retrieve the missing information?

Answer format:
Sufficient: YES/NO
Missing information: [description or N/A]
New query: [query or N/A]
""")
        
        if "Sufficient: YES" in sufficiency_check.content:
            return {
                'documents': all_retrieved,
                'attempts': attempt,
                'sufficient': True
            }
        
        # Parse new query for retry
        modified_query = extract_new_query(sufficiency_check.content)
        
        if not modified_query:
            break
    
    return {
        'documents': all_retrieved,
        'attempts': attempt,
        'sufficient': False
    }

Tool Use Efficiency

Agentic retrieval can be expensive because the agent may make many retrieval calls. Optimize by:

Caching retrieval results: Store embeddings and BM25 scores for common sub-queries.

Parallel retrieval: When multiple independent hops are identified, execute them concurrently.

from concurrent.futures import ThreadPoolExecutor

def parallel_retrieval(queries, vectorstore):
    """Execute multiple retrieval queries in parallel."""
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = {
            executor.submit(vectorstore.similarity_search, q, k=5): q 
            for q in queries
        }
        results = {}
        for future in futures:
            query = futures[future]
            try:
                results[query] = future.result()
            except Exception as e:
                results[query] = []
                log_error(f"Retrieval failed for {query}: {e}")
    return results

Early stopping: If confidence is high after a few retrieval steps, stop early rather than continuing to max_steps.

EXERCISE

Implement a simple ReAct agent that can handle a multi-hop question like "What was the total budget for the project approved by Dr. Smith, and what percentage of the IT budget does it represent?" Track the number of retrieval steps and evaluate whether final answers are accurate compared to explicit ground truth. This concludes Part 1 (Chapters 1-11). Part 2 (Chapters 12-22) continues with advanced agentic patterns, evaluation metrics, optimization strategies, and production deployment considerations.

← Chapter 10
Adaptive Retrieval
Chapter 12 →
Multi-Hop RAG