14. Error Recovery

Chapter 14 of 16 · 20 min

Agents fail. Tools return errors, models generate malformed tool arguments, loops run too long, and API keys expire. Error recovery is what separates production agents from demos.

Error categories

  1. Tool errors: Network timeouts, invalid API keys, malformed responses
  2. Parsing errors: Model outputs malformed JSON, wrong argument types
  3. Loop errors: Infinite loop, excessive token usage, stuck in retry
  4. Logic errors: Model calls wrong tool, ignores important results

Recovery strategies

def safe_tool_call(tool: Tool, max_retries: int = 2, **kwargs) -> str:
    """Wrapper with automatic retry and fallback"""
    for attempt in range(max_retries + 1):
        try:
            result = tool.invoke(**kwargs)
            if "Error:" in str(result):
                if attempt < max_retries:
                    continue  # Retry on error
                return f"Failed after {max_retries} retries: {result}"
            return result
        
        except Exception as e:
            if attempt < max_retries:
                time.sleep(2 ** attempt)  # Exponential backoff
                continue
            return f"Unrecoverable error: {e}"
    
    return "Max retries exceeded"

Parrot and re-plan recovery

When the model generates a tool call that cannot be executed, rather than dropping it silently, feed back a structured error and ask the model to re-plan:

def resilient_agent_loop(task: str, tools: list, model):
    messages = [{"role": "user", "content": task}]
    
    for turn in range(10):
        response = model.chat(messages, tools=[t.to_openai_schema() for t in tools])
        
        if response.message.tool_calls:
            for call in response.message.tool_calls:
                tool = tools_by_name.get(call.function.name)
                if not tool:
                    messages.append({
                        "role": "assistant",
                        "content": f"I intended to call '{call.function.name}'"
                    })
                    messages.append({
                        "role": "user",
                        "content": f"That tool is not available. Available tools: {list(tools_by_name.keys())}. Please re-think your approach and call a different tool."
                    })
                    continue
                
                result = safe_tool_call(tool, **call.function.arguments)
                messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
        
        else:
            # Model responded without tool call - check if task is done
            if is_task_complete(task, response.message.content):
                return response.message.content
            messages.append({"role": "assistant", "content": response.message.content})
            messages.append({"role": "user", "content": "Please continue with the task."})
    
    return "Max turns reached"

Circuit breakers

Implement circuit breakers to halt execution when error rates are too high:

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, timeout_seconds: int = 60):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = timeout_seconds
        self.last_failure_time = 0
    
    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
    
    def record_success(self):
        self.failures = 0
    
    def is_open(self) -> bool:
        if self.failures >= self.threshold:
            elapsed = time.time() - self.last_failure_time
            if elapsed < self.timeout:
                return True
            else:
                self.failures = 0  # Reset after timeout
        return False
EXERCISE

Introduce deliberate errors into the calculator tool (divide by zero, invalid schema, random timeouts). Verify that the resilient agent loop recovers correctly, retries the right number of times, and returns a graceful error message instead of crashing.