RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Introduction to AI Agents
  6. /Ch. 15
Introduction to AI Agents

15. Agent Evaluation

Chapter 15 of 16 · 20 min
KEY INSIGHT

Agent evaluation combines task completion checks (did it work?) with behavioral logging (how did it work?). Both are necessary—passing a result check while making 50 unnecessary tool calls signals a reasoning problem even if the final answer is correct.

Evaluating agents requires measuring task completion, tool usage efficiency, and output quality. Traditional LLM benchmarks do not capture this. Agent-specific benchmarks exist, but local agents need custom evaluation frameworks.

Core metrics

  1. Task completion rate: Did the agent finish the task? Binary pass/fail
  2. Tool call accuracy: Did the agent call the right tools in the right order?
  3. Turns to completion: How many LLM calls before the agent finishes?
  4. Error rate: What fraction of tool calls produced errors?
  5. Token efficiency: Total tokens used versus task complexity

Building a test suite

class AgentTestSuite:
    def __init__(self, agent):
        self.agent = agent
        self.results = []
    
    def run_test(self, task: str, expected_tools: list, max_turns: int = 10) -> dict:
        tool_log = []
        
        # Patch tool execution to log calls
        original_invoke = self.agent.tool_map.__getitem__
        def logged_invoke(name, **kwargs):
            tool_log.append(name)
            return original_invoke(name).invoke(**kwargs)
        
        self.agent.tool_map.__getitem__ = logged_invoke
        
        result = self.agent.run(task, max_turns=max_turns)
        
        self.agent.tool_map.__getitem__ = original_invoke
        
        return {
            "task": task,
            "result": result,
            "tools_called": tool_log,
            "expected_tools": expected_tools,
            "correct_sequence": tool_log == expected_tools,
            "turns": len(tool_log) + 1
        }
    
    def run_suite(self, tests: list) -> dict:
        for test in tests:
            self.results.append(self.run_test(**test))
        
        passed = sum(1 for r in self.results if r["correct_sequence"])
        return {
            "total": len(tests),
            "passed": passed,
            "pass_rate": passed / len(tests),
            "details": self.results
        }

Benchmark systems

For comparing agents against published baselines, use these resources:

  • GAIA (General AI Assistants): Meta's benchmark for AI assistants with multi-step tasks requiring web search, code execution, and file manipulation
  • MMLU (Massive Multitask Language Understanding): General knowledge benchmark
  • AgentBoard: Evaluates minigrid and web navigation tasks with granular metrics

Run local benchmarks by adapting open-source evaluation scripts to run against your local agent class.

Regression testing

Set up automated tests that run on every code change:

import pytest

def test_calculator_tool():
    suite = AgentTestSuite(agent_with_calculator())
    result = suite.run_test(
        task="Calculate the square root of 144",
        expected_tools=["calculator"]
    )
    assert result["correct_sequence"]

def test_web_search_tool():
    suite = AgentTestSuite(agent_with_web_search())
    result = suite.run_test(
        task="Who wrote Hamlet?",
        expected_tools=["web_search"]
    )
    assert "Shakespeare" in result["result"]
EXERCISE

Design a test suite of 10 tasks with known correct tool sequences. Run the suite against your agent and compute all five core metrics. Identify the worst-performing metric and trace it back to the specific failure mode causing it.

← Chapter 14
Error Recovery
Chapter 16 →
Agent Project: Research Assistant