16. Error Recovery
Multi-agent systems encounter errors at multiple layers: LLM generation failures, tool invocation timeouts, agent communication breakdowns, and orchestration logic exceptions. Effective recovery mechanisms maintain system integrity without manual intervention.
Error Classification Taxonomy
Not all errors warrant the same recovery strategy. Classification guides response selection.
Retryable Errors: Transient failures from rate limits, network timeouts, or service unavailability. Retry with exponential backoff resolves these automatically.
Correctable Errors: Semantic failures where the agent can self-correct given feedback. The system provides error context and allows retry with modified input.
Fatal Errors: Logical impossibilities or permanent failures requiring human intervention. The system escalates with full context.
# recovery/strategies.py
from enum import Enum
from dataclasses import dataclass
from typing import Callable
import time
class ErrorSeverity(Enum):
RETRYABLE = "retryable"
CORRECTABLE = "correctable"
FATAL = "fatal"
@dataclass
class ErrorContext:
error_type: str
message: str
agent_id: str
trace_id: str
previous_attempts: int
original_input: dict
class RecoveryStrategy:
def __init__(self, max_retries: int = 3, backoff_base: float = 1.0):
self.max_retries = max_retries
self.backoff_base = backoff_base
def should_retry(self, context: ErrorContext) -> bool:
return context.previous_attempts < self.max_retries
def compute_delay(self, attempt: int) -> float:
return self.backoff_base ** attempt
def execute_with_recovery(
self,
operation: Callable,
error_handler: Callable,
context: ErrorContext
) -> any:
for attempt in range(self.max_retries + 1):
try:
return operation()
except Exception as e:
if attempt >= self.max_retries:
return error_handler(context, e)
error_context = ErrorContext(
error_type=type(e).__name__,
message=str(e),
agent_id=context.agent_id,
trace_id=context.trace_id,
previous_attempts=attempt + 1,
original_input=context.original_input
)
if self.classify_error(e) == ErrorSeverity.FATAL:
return error_handler(error_context, e)
time.sleep(self.compute_delay(attempt))
return error_handler(context, None)
class AgentRecoveryManager:
def __init__(self, strategies: dict[str, RecoveryStrategy]):
self.strategies = strategies
self.fallback_agents: dict[str, any] = {}
def register_fallback(self, agent_id: str, fallback_agent: any):
self.fallback_agents[agent_id] = fallback_agent
def execute_with_fallback(
self,
primary_agent: any,
input_data: dict,
context: ErrorContext
) -> dict:
try:
return primary_agent.invoke(input_data)
except Exception as e:
if context.agent_id in self.fallback_agents:
return self.fallback_agents[context.agent_id].invoke(input_data)
raise
Graceful Degradation
When recovery fails, systems must degrade gracefully. Failing one agent should not cascade to total system failure. Dependency injection allows fallback agents to assume critical responsibilities.
Human-in-the-Loop Escalation
Errors exceeding recovery thresholds trigger escalation workflows. Escalation payloads include full trace context, error history, and input data—enabling efficient human triage.
Design a circuit breaker pattern for agent invocations that temporarily disables agents with failure rates exceeding 50% within a 60-second window and automatically re-enables them after a recovery period.