RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Debug Agent Reasoning and Tool Selection
HOW-TO · RAG

How to Debug Agent Reasoning and Tool Selection

advanced·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Agent with tool use, verbose logging enabled, Python 3.10+

What this does

Debugging agent reasoning involves inspecting the LLM's chain-of-thought, why it selected specific tools, and how it interpreted tool results. This helps fix incorrect tool choices, loops, and hallucinated arguments.

Steps

  • Enable chain-of-thought logging from the LLM. Set temperature to 0 and log raw responses.
import json

def debug_llm_response(messages: list, response) -> dict:
    return {
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason,
        "tool_calls": [
            {
                "name": tc.function.name,
                "args": tc.function.arguments,
                "id": tc.id
            }
            for tc in (response.choices[0].message.tool_calls or [])
        ],
        "content": response.choices[0].message.content
    }
  • Log the full message history at each turn. Capture what the agent sees.
def log_state(messages: list, turn: int):
    log.info(f"=== Turn {turn} ===")
    for i, msg in enumerate(messages):
        role = msg["role"]
        content_preview = str(msg.get("content", ""))[:200]
        tool_calls = msg.get("tool_calls")
        log.info(f"  [{i}] {role}: {content_preview}")
        if tool_calls:
            for tc in tool_calls:
                log.info(f"       -> Tool: {tc.function.name}({tc.function.arguments})")
  • Create a decision audit record. Track every tool choice with context.
class DecisionAudit:
    def __init__(self):
        self.entries = []

    def record(self, turn: int, tool_name: str, args: dict, reason: str, result: dict):
        self.entries.append({
            "turn": turn,
            "tool": tool_name,
            "args": args,
            "reason": reason,
            "result_success": "error" not in result,
            "result_preview": str(result)[:100]
        })

    def replay(self):
        for e in self.entries:
            print(f"Turn {e['turn']}: {e['tool']} → {'OK' if e['result_success'] else 'FAIL'}")
  • Simulate with fixed inputs. Reproduce issues by replaying the same prompt.
def replay_agent(history: list, tool_calls_to_override: dict = None):
    """Replay an agent session with optional tool result overrides."""
    messages = history.copy()
    for turn in range(5):
        response = llm.invoke(messages)
        if not response.tool_calls:
            return response.content
        for tc in response.tool_calls:
            if tool_calls_to_override and tc.function.name in tool_calls_to_override:
                result = tool_calls_to_override[tc.function.name]
            else:
                result = execute_tool(tc.function.name, json.loads(tc.function.arguments))
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})
    return "Max turns"
  • Compare tool selections between models. Run the same prompt on different models and diff the outputs.
def compare_models(prompt: str, models: list[str]):
    results = {}
    for model in models:
        llm = ChatOllama(model=model, temperature=0)
        response = llm.invoke(prompt)
        results[model] = {
            "tool_calls": [tc.function.name for tc in (response.tool_calls or [])],
            "finish_reason": response.finish_reason
        }
    return results

Verification

python -c "
audit = {'entries': []}
for i in range(3):
    audit['entries'].append({'turn': i, 'tool': 'search_web' if i % 2 == 0 else 'calculate'})
print(len(audit['entries']))
# Expected: 3
"

Common failures

  • Reasoning hidden by the model. Some models don't expose chain-of-thought. Use a model with thinking or reasoning output when available.
  • Tool result truncation hides issues. The LLM may receive only the first 500 chars of a tool result, causing it to miss critical data. Log the full result separately.
  • Non-deterministic behavior. Temperature > 0 causes different tool choices each run when debugging. Set temperature to 0 during debugging sessions.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • How to Implement Logging for Agent Debugging
  • How to Test Agent Behavior with Unit Tests
← All how-to guidesCourses →