15. Agent Evaluation
Chapter 15 of 16 · 20 min
Evaluating agents requires measuring task completion, tool usage efficiency, and output quality. Traditional LLM benchmarks do not capture this. Agent-specific benchmarks exist, but local agents need custom evaluation frameworks.
Core metrics
- Task completion rate: Did the agent finish the task? Binary pass/fail
- Tool call accuracy: Did the agent call the right tools in the right order?
- Turns to completion: How many LLM calls before the agent finishes?
- Error rate: What fraction of tool calls produced errors?
- Token efficiency: Total tokens used versus task complexity
Building a test suite
class AgentTestSuite:
def __init__(self, agent):
self.agent = agent
self.results = []
def run_test(self, task: str, expected_tools: list, max_turns: int = 10) -> dict:
tool_log = []
# Patch tool execution to log calls
original_invoke = self.agent.tool_map.__getitem__
def logged_invoke(name, **kwargs):
tool_log.append(name)
return original_invoke(name).invoke(**kwargs)
self.agent.tool_map.__getitem__ = logged_invoke
result = self.agent.run(task, max_turns=max_turns)
self.agent.tool_map.__getitem__ = original_invoke
return {
"task": task,
"result": result,
"tools_called": tool_log,
"expected_tools": expected_tools,
"correct_sequence": tool_log == expected_tools,
"turns": len(tool_log) + 1
}
def run_suite(self, tests: list) -> dict:
for test in tests:
self.results.append(self.run_test(**test))
passed = sum(1 for r in self.results if r["correct_sequence"])
return {
"total": len(tests),
"passed": passed,
"pass_rate": passed / len(tests),
"details": self.results
}
Benchmark systems
For comparing agents against published baselines, use these resources:
- GAIA (General AI Assistants): Meta's benchmark for AI assistants with multi-step tasks requiring web search, code execution, and file manipulation
- MMLU (Massive Multitask Language Understanding): General knowledge benchmark
- AgentBoard: Evaluates minigrid and web navigation tasks with granular metrics
Run local benchmarks by adapting open-source evaluation scripts to run against your local agent class.
Regression testing
Set up automated tests that run on every code change:
import pytest
def test_calculator_tool():
suite = AgentTestSuite(agent_with_calculator())
result = suite.run_test(
task="Calculate the square root of 144",
expected_tools=["calculator"]
)
assert result["correct_sequence"]
def test_web_search_tool():
suite = AgentTestSuite(agent_with_web_search())
result = suite.run_test(
task="Who wrote Hamlet?",
expected_tools=["web_search"]
)
assert "Shakespeare" in result["result"]
EXERCISE
Design a test suite of 10 tasks with known correct tool sequences. Run the suite against your agent and compute all five core metrics. Identify the worst-performing metric and trace it back to the specific failure mode causing it.