13. Multi-Agent Protocols
Multi-agent systems fail in predictable ways when protocols aren't explicitly defined. Most developers assume agents will "figure it out," then spend weeks debugging mysterious deadlocks or infinite loops.
The Core Problem
When two or more agents interact, you need a contract that specifies message ordering, termination conditions, and error propagation. Without this, you're building on quicksand.
Protocol Design Patterns
The two most reliable patterns are request-response and publish-subscribe. Request-response works for synchronous operations where the caller needs an immediate result:
class RequestResponseProtocol:
def __init__(self, timeout_seconds: float = 30.0):
self.timeout = timeout_seconds
self.pending: dict[str, asyncio.Future] = {}
async def send_request(self, agent_id: str, payload: dict) -> dict:
correlation_id = generate_id()
future = asyncio.Future()
self.pending[correlation_id] = future
try:
await self._deliver(agent_id, {
"type": "request",
"correlation_id": correlation_id,
"payload": payload
})
return await asyncio.wait_for(future, timeout=self.timeout)
finally:
self.pending.pop(correlation_id, None)
async def handle_response(self, correlation_id: str, result: dict):
if correlation_id in self.pending:
self.pending[correlation_id].set_result(result)
Failure Mode: Race Conditions
The most common failure happens when responses arrive out of order or get duplicated. Always use correlation IDs, never assume message arrival order matches dispatch order.
State Machine Approach
For complex multi-agent workflows, model each agent as a state machine with explicit transitions:
class AgentState:
IDLE = "idle"
WAITING = "waiting"
PROCESSING = "processing"
ERROR = "error"
TERMINATED = "terminated"
Define allowed transitions explicitly. This makes the protocol auditable and testable.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Design a protocol for three agents where Agent A must collect results from B and C before producing output. Write the state transitions in code and identify failure points where messages could be lost.