16. Error Recovery

Chapter 16 of 24 · 15 min

Multi-agent systems encounter errors at multiple layers: LLM generation failures, tool invocation timeouts, agent communication breakdowns, and orchestration logic exceptions. Effective recovery mechanisms maintain system integrity without manual intervention.

Error Classification Taxonomy

Not all errors warrant the same recovery strategy. Classification guides response selection.

Retryable Errors: Transient failures from rate limits, network timeouts, or service unavailability. Retry with exponential backoff resolves these automatically.

Correctable Errors: Semantic failures where the agent can self-correct given feedback. The system provides error context and allows retry with modified input.

Fatal Errors: Logical impossibilities or permanent failures requiring human intervention. The system escalates with full context.

# recovery/strategies.py
from enum import Enum
from dataclasses import dataclass
from typing import Callable
import time

class ErrorSeverity(Enum):
    RETRYABLE = "retryable"
    CORRECTABLE = "correctable"
    FATAL = "fatal"

@dataclass
class ErrorContext:
    error_type: str
    message: str
    agent_id: str
    trace_id: str
    previous_attempts: int
    original_input: dict

class RecoveryStrategy:
    def __init__(self, max_retries: int = 3, backoff_base: float = 1.0):
        self.max_retries = max_retries
        self.backoff_base = backoff_base
    
    def should_retry(self, context: ErrorContext) -> bool:
        return context.previous_attempts < self.max_retries
    
    def compute_delay(self, attempt: int) -> float:
        return self.backoff_base ** attempt
    
    def execute_with_recovery(
        self, 
        operation: Callable, 
        error_handler: Callable,
        context: ErrorContext
    ) -> any:
        for attempt in range(self.max_retries + 1):
            try:
                return operation()
            except Exception as e:
                if attempt >= self.max_retries:
                    return error_handler(context, e)
                
                error_context = ErrorContext(
                    error_type=type(e).__name__,
                    message=str(e),
                    agent_id=context.agent_id,
                    trace_id=context.trace_id,
                    previous_attempts=attempt + 1,
                    original_input=context.original_input
                )
                
                if self.classify_error(e) == ErrorSeverity.FATAL:
                    return error_handler(error_context, e)
                
                time.sleep(self.compute_delay(attempt))
        
        return error_handler(context, None)

class AgentRecoveryManager:
    def __init__(self, strategies: dict[str, RecoveryStrategy]):
        self.strategies = strategies
        self.fallback_agents: dict[str, any] = {}
    
    def register_fallback(self, agent_id: str, fallback_agent: any):
        self.fallback_agents[agent_id] = fallback_agent
    
    def execute_with_fallback(
        self, 
        primary_agent: any, 
        input_data: dict,
        context: ErrorContext
    ) -> dict:
        try:
            return primary_agent.invoke(input_data)
        except Exception as e:
            if context.agent_id in self.fallback_agents:
                return self.fallback_agents[context.agent_id].invoke(input_data)
            raise

Graceful Degradation

When recovery fails, systems must degrade gracefully. Failing one agent should not cascade to total system failure. Dependency injection allows fallback agents to assume critical responsibilities.

Human-in-the-Loop Escalation

Errors exceeding recovery thresholds trigger escalation workflows. Escalation payloads include full trace context, error history, and input data—enabling efficient human triage.

EXERCISE

Design a circuit breaker pattern for agent invocations that temporarily disables agents with failure rates exceeding 50% within a 60-second window and automatically re-enables them after a recovery period.