Multi-Agent Protocols — Custom Agent Frameworks (Chapter 13)

Multi-agent systems fail in predictable ways when protocols aren't explicitly defined. Most developers assume agents will "figure it out," then spend weeks debugging mysterious deadlocks or infinite loops.

The Core Problem

When two or more agents interact, you need a contract that specifies message ordering, termination conditions, and error propagation. Without this, you're building on quicksand.

Protocol Design Patterns

The two most reliable patterns are request-response and publish-subscribe. Request-response works for synchronous operations where the caller needs an immediate result:

class RequestResponseProtocol:
    def __init__(self, timeout_seconds: float = 30.0):
        self.timeout = timeout_seconds
        self.pending: dict[str, asyncio.Future] = {}
    
    async def send_request(self, agent_id: str, payload: dict) -> dict:
        correlation_id = generate_id()
        future = asyncio.Future()
        self.pending[correlation_id] = future
        
        try:
            await self._deliver(agent_id, {
                "type": "request",
                "correlation_id": correlation_id,
                "payload": payload
            })
            return await asyncio.wait_for(future, timeout=self.timeout)
        finally:
            self.pending.pop(correlation_id, None)
    
    async def handle_response(self, correlation_id: str, result: dict):
        if correlation_id in self.pending:
            self.pending[correlation_id].set_result(result)

Failure Mode: Race Conditions

The most common failure happens when responses arrive out of order or get duplicated. Always use correlation IDs, never assume message arrival order matches dispatch order.

State Machine Approach

For complex multi-agent workflows, model each agent as a state machine with explicit transitions:

class AgentState:
    IDLE = "idle"
    WAITING = "waiting"
    PROCESSING = "processing"
    ERROR = "error"
    TERMINATED = "terminated"

Define allowed transitions explicitly. This makes the protocol auditable and testable.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.