Agent Evaluation — Introduction to AI Agents (Chapter 15)

Evaluating agents requires measuring task completion, tool usage efficiency, and output quality. Traditional LLM benchmarks do not capture this. Agent-specific benchmarks exist, but local agents need custom evaluation frameworks.

Core metrics

Task completion rate: Did the agent finish the task? Binary pass/fail
Tool call accuracy: Did the agent call the right tools in the right order?
Turns to completion: How many LLM calls before the agent finishes?
Error rate: What fraction of tool calls produced errors?
Token efficiency: Total tokens used versus task complexity

Building a test suite

class AgentTestSuite:
    def __init__(self, agent):
        self.agent = agent
        self.results = []
    
    def run_test(self, task: str, expected_tools: list, max_turns: int = 10) -> dict:
        tool_log = []
        
        # Patch tool execution to log calls
        original_invoke = self.agent.tool_map.__getitem__
        def logged_invoke(name, **kwargs):
            tool_log.append(name)
            return original_invoke(name).invoke(**kwargs)
        
        self.agent.tool_map.__getitem__ = logged_invoke
        
        result = self.agent.run(task, max_turns=max_turns)
        
        self.agent.tool_map.__getitem__ = original_invoke
        
        return {
            "task": task,
            "result": result,
            "tools_called": tool_log,
            "expected_tools": expected_tools,
            "correct_sequence": tool_log == expected_tools,
            "turns": len(tool_log) + 1
        }
    
    def run_suite(self, tests: list) -> dict:
        for test in tests:
            self.results.append(self.run_test(**test))
        
        passed = sum(1 for r in self.results if r["correct_sequence"])
        return {
            "total": len(tests),
            "passed": passed,
            "pass_rate": passed / len(tests),
            "details": self.results
        }

Benchmark systems

For comparing agents against published baselines, use these resources:

GAIA (General AI Assistants): Meta's benchmark for AI assistants with multi-step tasks requiring web search, code execution, and file manipulation
MMLU (Massive Multitask Language Understanding): General knowledge benchmark
AgentBoard: Evaluates minigrid and web navigation tasks with granular metrics

Run local benchmarks by adapting open-source evaluation scripts to run against your local agent class.

Regression testing

Set up automated tests that run on every code change:

import pytest

def test_calculator_tool():
    suite = AgentTestSuite(agent_with_calculator())
    result = suite.run_test(
        task="Calculate the square root of 144",
        expected_tools=["calculator"]
    )
    assert result["correct_sequence"]

def test_web_search_tool():
    suite = AgentTestSuite(agent_with_web_search())
    result = suite.run_test(
        task="Who wrote Hamlet?",
        expected_tools=["web_search"]
    )
    assert "Shakespeare" in result["result"]