10. Error Recovery

Chapter 10 of 18 · 20 min

When a function call fails in production, the difference between a resilient system and a broken one comes down to error recovery strategy. Function calls can fail for many reasons: tool execution errors, malformed responses, model-generated invalid parameters, network timeouts, or resource exhaustion on the local model server.

Understanding Error Types

Function calling errors fall into three broad categories:

Model-side failures occur when the model generates malformed JSON, missing required parameters, or parameters that fail validation. For example, asking a date parsing function to handle "next thursday" when the function expects ISO 8601 format.

Tool-side failures occur when the tool executes but encounters an error—file not found, database connection refused, API rate limit hit, or timeout exceeded.

Infrastructure failures occur at the system level: Ollama server crashed, GPU memory exhausted, model failed to load, or network partition between components.

Recovery Patterns

from enum import Enum
from typing import Any, Optional
import json

class RecoveryStrategy(Enum):
    RETRY = "retry"
    FALLBACK = "fallback"
    SKIP = "skip"
    ABORT = "abort"

def handle_tool_error(
    error: Exception,
    tool_name: str,
    attempt: int,
    max_retries: int = 3
) -> tuple[RecoveryStrategy, Optional[str]]:
    """Determine recovery strategy based on error type."""
    
    # Non-retryable errors
    if isinstance(error, json.JSONDecodeError):
        return RecoveryStrategy.SKIP, f"Malformed response from {tool_name}"
    
    if isinstance(error, FileNotFoundError):
        return RecoveryStrategy.ABORT, f"Required file missing for {tool_name}"
    
    # Retryable errors
    if attempt < max_retries:
        if isinstance(error, TimeoutError):
            return RecoveryStrategy.RETRY, f"Timeout on attempt {attempt + 1}"
        if isinstance(error, ConnectionError):
            return RecoveryStrategy.RETRY, f"Connection issue on attempt {attempt + 1}"
    
    # Max retries exceeded
    return RecoveryStrategy.FALLBACK, f"Max retries reached for {tool_name}"

Graceful Degradation

When a tool fails and no fallback exists, the system should continue operating with degraded capability rather than crashing:

class ToolRegistry:
    def __init__(self):
        self.tools: dict[str, callable] = {}
        self.fallbacks: dict[str, callable] = {}
        self.degraded_mode: bool = False
    
    def execute_with_recovery(
        self, 
        tool_name: str, 
        parameters: dict[str, Any]
    ) -> Any:
        try:
            return self.tools[tool_name](**parameters)
        except Exception as e:
            strategy, message = handle_tool_error(
                e, tool_name, attempt=1
            )
            
            if strategy == RecoveryStrategy.FALLBACK:
                if tool_name in self.fallbacks:
                    return self.fallbacks[tool_name](**parameters)
            
            if strategy == RecoveryStrategy.SKIP:
                self.degraded_mode = True
                return {"error": message, "degraded": True}
            
            raise ToolExecutionError(message)

Logging for Debugging

Every recovery action should emit structured logs for post-mortem analysis:

import structlog

logger = structlog.get_logger()

def log_recovery_event(
    tool_name: str,
    error: Exception,
    strategy: RecoveryStrategy,
    context: dict
):
    logger.warning(
        "tool_recovery_triggered",
        tool=tool_name,
        error_type=type(error).__name__,
        error_message=str(error),
        strategy=strategy.value,
        **context
    )
EXERCISE

Implement a retry mechanism with exponential backoff for your tool executor, logging each attempt with the error type and backoff duration. Test it by temporarily disabling a tool and observing the recovery sequence.