How to set up agent error recovery and retry logic
AI agent with tool calls, error logging
What this does
Robust agent systems require failure recovery beyond basic exception handling. This guide implements exponential backoff retry, a circuit breaker pattern, and dead-letter handling to make agents resilient to transient failures.
Steps
Step 1: Install dependencies
pip install tenacity
tenacity provides battle-tested retry decorators. time and random ship with Python.
Step 2: Define retry configuration and circuit breaker
import time
import random
from typing import Callable, Any
from dataclasses import dataclass
@dataclass
class RetryConfig:
"""Exponential backoff configuration."""
max_attempts: int = 3
base_delay: float = 1.0
max_delay: float = 30.0
jitter: float = 0.5
def get_delay(self, attempt: int) -> float:
"""Calculate delay for a given attempt number."""
delay = min(self.base_delay * (2 ** attempt), self.max_delay)
jitter_range = delay * self.jitter
return delay + random.uniform(-jitter_range, jitter_range)
class CircuitBreaker:
"""Circuit breaker to prevent cascading failures."""
def __init__(self, failure_threshold: int = 3, recovery_timeout: float = 60.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failures = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
def record_failure(self):
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = "open"
print(f"[CircuitBreaker] Opened after {self.failures} failures")
def record_success(self):
self.failures = 0
self.state = "closed"
def is_available(self) -> bool:
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = "half-open"
return True
return False
# half-open: allow one test request
return True
Step 3: Implement the retry decorator
def with_retry(agent_fn: Callable, config: RetryConfig, breaker: CircuitBreaker):
"""Decorator applying retry and circuit breaker logic."""
def wrapper(*args, **kwargs):
if not breaker.is_available():
raise RuntimeError("[CircuitBreaker] Circuit is open. Request rejected.")
last_error = None
for attempt in range(config.max_attempts):
try:
result = agent_fn(*args, **kwargs)
breaker.record_success()
return result
except Exception as e:
last_error = e
breaker.record_failure()
if attempt < config.max_attempts - 1:
delay = config.get_delay(attempt)
print(f"[Retry] Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s")
time.sleep(delay)
else:
print(f"[Retry] All {config.max_attempts} attempts exhausted.")
raise last_error
return wrapper
Step 4: Apply retry logic to an agent function
import random
def call_llm(prompt: str) -> str:
"""Simulated LLM call that fails intermittently."""
if random.random() < 0.3:
raise ConnectionError("Simulated transient network failure")
return f"Response to: {prompt}"
config = RetryConfig(max_attempts=3, base_delay=0.5, max_delay=5.0)
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10.0)
retrying_llm_call = with_retry(call_llm, config, breaker)
# Test the agent
for i in range(10):
try:
result = retrying_llm_call(f"task_{i}")
print(f"Success: {result}")
except RuntimeError as e:
print(f"Circuit open: {e}")
time.sleep(2)
except ConnectionError as e:
print(f"Retry failed: {e}")
Step 5: Implement dead-letter queue for persistent failures
from collections import deque
class DeadLetterQueue:
"""Queue for tasks that fail after all retry attempts."""
def __init__(self, max_size: int = 100):
self.queue = deque(maxlen=max_size)
def add(self, task: Any, error: Exception):
self.queue.append({
"task": task,
"error": str(error),
"timestamp": time.time()
})
print(f"[DLQ] Task added. Queue size: {len(self.queue)}")
def get_failed(self) -> list:
return list(self.queue)
dlq = DeadLetterQueue()
def with_dlq(agent_fn: Callable, config: RetryConfig, breaker: CircuitBreaker, dlq: DeadLetterQueue):
"""Extended decorator that routes persistently failed tasks to DLQ."""
def wrapper(*args, **kwargs):
try:
return with_retry(agent_fn, config, breaker)(*args, **kwargs)
except Exception as e:
dlq.add({"args": args, "kwargs": kwargs}, e)
return None
return wrapper
Verification
Run the test loop and verify:
- Successful calls return a response string starting with "Response to:".
- Transient failures are retried automatically with increasing delays.
- After 3 consecutive failures, the circuit breaker opens and subsequent calls raise
RuntimeErrorwith "Circuit is open". - After the recovery timeout (10 seconds), the circuit enters half-open state and allows one test call.
- Failed tasks appear in the DLQ after exhausting all retries.
Common failures
Retrying non-transient errors. Retrying a 400 Bad Request response wastes API quota and delays discovery of the real problem. Inspect the exception type and only retry on
ConnectionError,Timeout, or 5xx HTTP codes.Circuit breaker opening too aggressively. A low
failure_thresholdcombined with a shortrecovery_timeoutcan cause the circuit to oscillate between open and closed states during partial outages. Set the threshold to at least 3-5 failures and the timeout to 60+ seconds for most LLM APIs.Missing DLQ monitoring. Dead-letter queues that grow without inspection create silent failures in production. Add a scheduled task or webhook that alerts when DLQ depth exceeds a threshold.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- Implement Parallel Agent Execution - Combine retry logic with parallel fan-out so that failed agent tasks are rescheduled alongside other agents without blocking the entire pipeline.
- Implement Streaming Responses in AI APIs - Stream responses with retry logic so partial outputs are preserved even when a connection drops mid-stream.