What this does

Reflection lets an agent critique its own output, identify errors or gaps, and retry with improved reasoning. Self-correction reduces hallucination and improves answer quality without human intervention.

Steps

Add a reflection step after the initial answer. Ask the LLM to critique its own response.

def reflect(answer: str, context: str, llm) -> str:
    prompt = f"""Context: {context}

Initial answer: {answer}

Critique this answer. Identify any:
1. Factual errors or hallucinations
2. Missing information
3. Unclear reasoning
4. Unsupported claims

Provide a revised, improved answer:"""
    return llm.invoke(prompt).content

Implement the reflect-retry loop. Keep trying until quality improves or max retries.

def reflect_and_correct(question: str, context: str, llm, max_reflections=3) -> str:
    answer = llm.invoke(f"Context: {context}\nQuestion: {question}\nAnswer:").content

    for i in range(max_reflections):
        critique = llm.invoke(f"""Answer: {answer}

Critique this answer. Rate it 1-10. If < 8, explain what's wrong and provide an improved version.
Start your response with 'SCORE: N'.""").content

        score = extract_score(critique)
        if score and score >= 8:
            return answer

        # Extract improved answer from critique
        improved = extract_improved(critique) or critique
        answer = improved

    return answer

Extract scores and improved answers from reflection output.

import re

def extract_score(reflection: str) -> int | None:
    match = re.search(r'SCORE:\s*(\d+)', reflection)
    return int(match.group(1)) if match else None

def extract_improved(reflection: str) -> str | None:
    # Look for content after "Improved answer:" marker
    if "Improved answer:" in reflection:
        return reflection.split("Improved answer:")[1].strip()
    return None

Add self-verification of tool results. After calling a tool, verify the result is reasonable.

def verify_tool_result(tool_name: str, args: dict, result: str, llm) -> bool:
    prompt = f"""Tool: {tool_name}
Arguments: {args}
Result: {result[:500]}

Is this result reasonable? Answer only YES or NO."""
    response = llm.invoke(prompt).content.strip()
    return response == "YES"

Create a reflection tool the agent can call autonomously.

@tool
def reflect_on_work(work_product: str, criteria: str = "") -> str:
    """Reflect on and critique your own work product."""
    prompt = f"""Work product: {work_product}
Criteria: {criteria}

Identify issues and suggest improvements. Provide a revised version."""
    return llm.invoke(prompt).content

Verification

python -c "
import re
def extract_score(text):
    m = re.search(r'SCORE:\s*(\d+)', text)
    return int(m.group(1)) if m else None
print(extract_score('SCORE: 8'))
# Expected: 8
"

Common failures

Reflection confirms incorrect answers. The LLM may agree with its own mistake instead of critiquing it. Use a separate "critic" model or flip the temperature.
Infinite correction loop. The agent keeps finding new issues and never finalizes. Cap reflections at a small number (2-3).
Score inflation. The LLM consistently rates itself 9/10 regardless of quality. Use absolute criteria (e.g., "Does the answer cite sources?") instead of subjective scoring.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

How to Create Agent Decision-Making Logic
How to Build Error Handling in Agent Pipelines