What this does

Debugging agent reasoning involves inspecting the LLM's chain-of-thought, why it selected specific tools, and how it interpreted tool results. This helps fix incorrect tool choices, loops, and hallucinated arguments.

Steps

Enable chain-of-thought logging from the LLM. Set temperature to 0 and log raw responses.

import json

def debug_llm_response(messages: list, response) -> dict:
    return {
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason,
        "tool_calls": [
            {
                "name": tc.function.name,
                "args": tc.function.arguments,
                "id": tc.id
            }
            for tc in (response.choices[0].message.tool_calls or [])
        ],
        "content": response.choices[0].message.content
    }

Log the full message history at each turn. Capture what the agent sees.

def log_state(messages: list, turn: int):
    log.info(f"=== Turn {turn} ===")
    for i, msg in enumerate(messages):
        role = msg["role"]
        content_preview = str(msg.get("content", ""))[:200]
        tool_calls = msg.get("tool_calls")
        log.info(f"  [{i}] {role}: {content_preview}")
        if tool_calls:
            for tc in tool_calls:
                log.info(f"       -> Tool: {tc.function.name}({tc.function.arguments})")

Create a decision audit record. Track every tool choice with context.

class DecisionAudit:
    def __init__(self):
        self.entries = []

    def record(self, turn: int, tool_name: str, args: dict, reason: str, result: dict):
        self.entries.append({
            "turn": turn,
            "tool": tool_name,
            "args": args,
            "reason": reason,
            "result_success": "error" not in result,
            "result_preview": str(result)[:100]
        })

    def replay(self):
        for e in self.entries:
            print(f"Turn {e['turn']}: {e['tool']} → {'OK' if e['result_success'] else 'FAIL'}")

Simulate with fixed inputs. Reproduce issues by replaying the same prompt.

def replay_agent(history: list, tool_calls_to_override: dict = None):
    """Replay an agent session with optional tool result overrides."""
    messages = history.copy()
    for turn in range(5):
        response = llm.invoke(messages)
        if not response.tool_calls:
            return response.content
        for tc in response.tool_calls:
            if tool_calls_to_override and tc.function.name in tool_calls_to_override:
                result = tool_calls_to_override[tc.function.name]
            else:
                result = execute_tool(tc.function.name, json.loads(tc.function.arguments))
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})
    return "Max turns"

Compare tool selections between models. Run the same prompt on different models and diff the outputs.

def compare_models(prompt: str, models: list[str]):
    results = {}
    for model in models:
        llm = ChatOllama(model=model, temperature=0)
        response = llm.invoke(prompt)
        results[model] = {
            "tool_calls": [tc.function.name for tc in (response.tool_calls or [])],
            "finish_reason": response.finish_reason
        }
    return results

Verification

python -c "
audit = {'entries': []}
for i in range(3):
    audit['entries'].append({'turn': i, 'tool': 'search_web' if i % 2 == 0 else 'calculate'})
print(len(audit['entries']))
# Expected: 3
"

Common failures

Reasoning hidden by the model. Some models don't expose chain-of-thought. Use a model with thinking or reasoning output when available.
Tool result truncation hides issues. The LLM may receive only the first 500 chars of a tool result, causing it to miss critical data. Log the full result separately.
Non-deterministic behavior. Temperature > 0 causes different tool choices each run when debugging. Set temperature to 0 during debugging sessions.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

How to Implement Logging for Agent Debugging
How to Test Agent Behavior with Unit Tests