10. Error Recovery
When a function call fails in production, the difference between a resilient system and a broken one comes down to error recovery strategy. Function calls can fail for many reasons: tool execution errors, malformed responses, model-generated invalid parameters, network timeouts, or resource exhaustion on the local model server.
Understanding Error Types
Function calling errors fall into three broad categories:
Model-side failures occur when the model generates malformed JSON, missing required parameters, or parameters that fail validation. For example, asking a date parsing function to handle "next thursday" when the function expects ISO 8601 format.
Tool-side failures occur when the tool executes but encounters an error—file not found, database connection refused, API rate limit hit, or timeout exceeded.
Infrastructure failures occur at the system level: Ollama server crashed, GPU memory exhausted, model failed to load, or network partition between components.
Recovery Patterns
from enum import Enum
from typing import Any, Optional
import json
class RecoveryStrategy(Enum):
RETRY = "retry"
FALLBACK = "fallback"
SKIP = "skip"
ABORT = "abort"
def handle_tool_error(
error: Exception,
tool_name: str,
attempt: int,
max_retries: int = 3
) -> tuple[RecoveryStrategy, Optional[str]]:
"""Determine recovery strategy based on error type."""
# Non-retryable errors
if isinstance(error, json.JSONDecodeError):
return RecoveryStrategy.SKIP, f"Malformed response from {tool_name}"
if isinstance(error, FileNotFoundError):
return RecoveryStrategy.ABORT, f"Required file missing for {tool_name}"
# Retryable errors
if attempt < max_retries:
if isinstance(error, TimeoutError):
return RecoveryStrategy.RETRY, f"Timeout on attempt {attempt + 1}"
if isinstance(error, ConnectionError):
return RecoveryStrategy.RETRY, f"Connection issue on attempt {attempt + 1}"
# Max retries exceeded
return RecoveryStrategy.FALLBACK, f"Max retries reached for {tool_name}"
Graceful Degradation
When a tool fails and no fallback exists, the system should continue operating with degraded capability rather than crashing:
class ToolRegistry:
def __init__(self):
self.tools: dict[str, callable] = {}
self.fallbacks: dict[str, callable] = {}
self.degraded_mode: bool = False
def execute_with_recovery(
self,
tool_name: str,
parameters: dict[str, Any]
) -> Any:
try:
return self.tools[tool_name](**parameters)
except Exception as e:
strategy, message = handle_tool_error(
e, tool_name, attempt=1
)
if strategy == RecoveryStrategy.FALLBACK:
if tool_name in self.fallbacks:
return self.fallbacks[tool_name](**parameters)
if strategy == RecoveryStrategy.SKIP:
self.degraded_mode = True
return {"error": message, "degraded": True}
raise ToolExecutionError(message)
Logging for Debugging
Every recovery action should emit structured logs for post-mortem analysis:
import structlog
logger = structlog.get_logger()
def log_recovery_event(
tool_name: str,
error: Exception,
strategy: RecoveryStrategy,
context: dict
):
logger.warning(
"tool_recovery_triggered",
tool=tool_name,
error_type=type(error).__name__,
error_message=str(error),
strategy=strategy.value,
**context
)
Implement a retry mechanism with exponential backoff for your tool executor, logging each attempt with the error type and backoff duration. Test it by temporarily disabling a tool and observing the recovery sequence.