19. Multi-Agent Testing
Testing multi-agent systems requires strategies that validate both individual agent behavior and emergent system properties. Traditional unit testing falls short for interaction-heavy architectures.
Testing Pyramid for Multi-Agent Systems
Unit Level: Individual agent behavior in isolation. Mock external dependencies, verify tool selection logic, validate prompt handling.
Integration Level: Agent-to-agent communication patterns. Test message routing, verify context propagation, validate error handling across boundaries.
System Level: End-to-end workflows with realistic inputs. Verify emergent behaviors, measure performance characteristics, validate failure modes.
# testing/agent_test_framework.py
import asyncio
from dataclasses import dataclass, field
from typing import Any, Callable
from collections.abc import Awaitable
@dataclass
class TestScenario:
name: str
agents: dict[str, Any]
initial_messages: dict[str, str] = field(default_factory=dict)
expected_final_states: dict[str, str] = field(default_factory=dict)
max_duration_seconds: float = 30
@dataclass
class TestResult:
scenario: str
passed: bool
duration_ms: float
error_message: str | None = None
agent_states: dict[str, str] = field(default_factory=dict)
class MultiAgentTestRunner:
def __init__(self, timeout_seconds: float = 30):
self.timeout_seconds = timeout_seconds
self.results: list[TestResult] = []
async def run_scenario(self, scenario: TestScenario) -> TestResult:
start_time = asyncio.get_event_loop().time()
try:
agent_states = await asyncio.wait_for(
self._execute_scenario(scenario),
timeout=scenario.max_duration_seconds
)
duration = (asyncio.get_event_loop().time() - start_time) * 1000
passed = all(
agent_states.get(agent_id) == expected
for agent_id, expected in scenario.expected_final_states.items()
)
return TestResult(
scenario=scenario.name,
passed=passed,
duration_ms=duration,
agent_states=agent_states
)
except asyncio.TimeoutError:
return TestResult(
scenario=scenario.name,
passed=False,
duration_ms=self.timeout_seconds * 1000,
error_message="Scenario timeout exceeded"
)
except Exception as e:
return TestResult(
scenario=scenario.name,
passed=False,
duration_ms=(asyncio.get_event_loop().time() - start_time) * 1000,
error_message=str(e)
)
async def _execute_scenario(self, scenario: TestScenario) -> dict[str, str]:
tasks = [
agent.run()
for agent in scenario.agents.values()
]
await asyncio.gather(*tasks, return_exceptions=True)
return {
agent_id: agent.get_state()
for agent_id, agent in scenario.agents.items()
}
def run_suite(self, scenarios: list[TestScenario]) -> list[TestResult]:
return asyncio.run(
asyncio.gather(*[self.run_scenario(s) for s in scenarios])
)
Scenario-Based Testing
Predefined test scenarios capture critical workflows: successful end-to-end flows, error recovery paths, load patterns, and adversarial inputs. Scenario libraries grow with production experience.
Chaos Testing for Multi-Agent Systems
Randomly introduce failures—agent crashes, network partitions, tool timeouts—to validate system resilience. Chaos testing reveals assumptions about failure handling that unit tests miss.
Behavioral Testing with Traces
Instead of testing implementation details, verify behavioral properties: "For any input, no agent loops infinitely," or "If agent A fails, agent B completes its task within timeout." Model checking tools validate these properties.
Design a test scenario that simulates agent communication failure—where one agent stops responding mid-conversation—and verify that the receiving agent implements appropriate timeout and fallback behavior.