Error Recovery — Introduction to AI Agents (Chapter 14)

Agents fail. Tools return errors, models generate malformed tool arguments, loops run too long, and API keys expire. Error recovery is what separates production agents from demos.

Error categories

Tool errors: Network timeouts, invalid API keys, malformed responses
Parsing errors: Model outputs malformed JSON, wrong argument types
Loop errors: Infinite loop, excessive token usage, stuck in retry
Logic errors: Model calls wrong tool, ignores important results

Recovery strategies

def safe_tool_call(tool: Tool, max_retries: int = 2, **kwargs) -> str:
    """Wrapper with automatic retry and fallback"""
    for attempt in range(max_retries + 1):
        try:
            result = tool.invoke(**kwargs)
            if "Error:" in str(result):
                if attempt < max_retries:
                    continue  # Retry on error
                return f"Failed after {max_retries} retries: {result}"
            return result
        
        except Exception as e:
            if attempt < max_retries:
                time.sleep(2 ** attempt)  # Exponential backoff
                continue
            return f"Unrecoverable error: {e}"
    
    return "Max retries exceeded"

Parrot and re-plan recovery

When the model generates a tool call that cannot be executed, rather than dropping it silently, feed back a structured error and ask the model to re-plan:

def resilient_agent_loop(task: str, tools: list, model):
    messages = [{"role": "user", "content": task}]
    
    for turn in range(10):
        response = model.chat(messages, tools=[t.to_openai_schema() for t in tools])
        
        if response.message.tool_calls:
            for call in response.message.tool_calls:
                tool = tools_by_name.get(call.function.name)
                if not tool:
                    messages.append({
                        "role": "assistant",
                        "content": f"I intended to call '{call.function.name}'"
                    })
                    messages.append({
                        "role": "user",
                        "content": f"That tool is not available. Available tools: {list(tools_by_name.keys())}. Please re-think your approach and call a different tool."
                    })
                    continue
                
                result = safe_tool_call(tool, **call.function.arguments)
                messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
        
        else:
            # Model responded without tool call - check if task is done
            if is_task_complete(task, response.message.content):
                return response.message.content
            messages.append({"role": "assistant", "content": response.message.content})
            messages.append({"role": "user", "content": "Please continue with the task."})
    
    return "Max turns reached"

Circuit breakers

Implement circuit breakers to halt execution when error rates are too high:

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, timeout_seconds: int = 60):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = timeout_seconds
        self.last_failure_time = 0
    
    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
    
    def record_success(self):
        self.failures = 0
    
    def is_open(self) -> bool:
        if self.failures >= self.threshold:
            elapsed = time.time() - self.last_failure_time
            if elapsed < self.timeout:
                return True
            else:
                self.failures = 0  # Reset after timeout
        return False