19. Multi-Agent Testing

Chapter 19 of 24 · 15 min

Testing multi-agent systems requires strategies that validate both individual agent behavior and emergent system properties. Traditional unit testing falls short for interaction-heavy architectures.

Testing Pyramid for Multi-Agent Systems

Unit Level: Individual agent behavior in isolation. Mock external dependencies, verify tool selection logic, validate prompt handling.

Integration Level: Agent-to-agent communication patterns. Test message routing, verify context propagation, validate error handling across boundaries.

System Level: End-to-end workflows with realistic inputs. Verify emergent behaviors, measure performance characteristics, validate failure modes.

# testing/agent_test_framework.py
import asyncio
from dataclasses import dataclass, field
from typing import Any, Callable
from collections.abc import Awaitable

@dataclass
class TestScenario:
    name: str
    agents: dict[str, Any]
    initial_messages: dict[str, str] = field(default_factory=dict)
    expected_final_states: dict[str, str] = field(default_factory=dict)
    max_duration_seconds: float = 30

@dataclass
class TestResult:
    scenario: str
    passed: bool
    duration_ms: float
    error_message: str | None = None
    agent_states: dict[str, str] = field(default_factory=dict)

class MultiAgentTestRunner:
    def __init__(self, timeout_seconds: float = 30):
        self.timeout_seconds = timeout_seconds
        self.results: list[TestResult] = []
    
    async def run_scenario(self, scenario: TestScenario) -> TestResult:
        start_time = asyncio.get_event_loop().time()
        
        try:
            agent_states = await asyncio.wait_for(
                self._execute_scenario(scenario),
                timeout=scenario.max_duration_seconds
            )
            
            duration = (asyncio.get_event_loop().time() - start_time) * 1000
            
            passed = all(
                agent_states.get(agent_id) == expected
                for agent_id, expected in scenario.expected_final_states.items()
            )
            
            return TestResult(
                scenario=scenario.name,
                passed=passed,
                duration_ms=duration,
                agent_states=agent_states
            )
            
        except asyncio.TimeoutError:
            return TestResult(
                scenario=scenario.name,
                passed=False,
                duration_ms=self.timeout_seconds * 1000,
                error_message="Scenario timeout exceeded"
            )
        except Exception as e:
            return TestResult(
                scenario=scenario.name,
                passed=False,
                duration_ms=(asyncio.get_event_loop().time() - start_time) * 1000,
                error_message=str(e)
            )
    
    async def _execute_scenario(self, scenario: TestScenario) -> dict[str, str]:
        tasks = [
            agent.run()
            for agent in scenario.agents.values()
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
        
        return {
            agent_id: agent.get_state()
            for agent_id, agent in scenario.agents.items()
        }
    
    def run_suite(self, scenarios: list[TestScenario]) -> list[TestResult]:
        return asyncio.run(
            asyncio.gather(*[self.run_scenario(s) for s in scenarios])
        )

Scenario-Based Testing

Predefined test scenarios capture critical workflows: successful end-to-end flows, error recovery paths, load patterns, and adversarial inputs. Scenario libraries grow with production experience.

Chaos Testing for Multi-Agent Systems

Randomly introduce failures—agent crashes, network partitions, tool timeouts—to validate system resilience. Chaos testing reveals assumptions about failure handling that unit tests miss.

Behavioral Testing with Traces

Instead of testing implementation details, verify behavioral properties: "For any input, no agent loops infinitely," or "If agent A fails, agent B completes its task within timeout." Model checking tools validate these properties.

EXERCISE

Design a test scenario that simulates agent communication failure—where one agent stops responding mid-conversation—and verify that the receiving agent implements appropriate timeout and fallback behavior.