What this does

Robust agent systems require failure recovery beyond basic exception handling. This guide implements exponential backoff retry, a circuit breaker pattern, and dead-letter handling to make agents resilient to transient failures.

Steps

Step 1: Install dependencies

pip install tenacity

tenacity provides battle-tested retry decorators. time and random ship with Python.

Step 2: Define retry configuration and circuit breaker

import time
import random
from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class RetryConfig:
    """Exponential backoff configuration."""
    max_attempts: int = 3
    base_delay: float = 1.0
    max_delay: float = 30.0
    jitter: float = 0.5

    def get_delay(self, attempt: int) -> float:
        """Calculate delay for a given attempt number."""
        delay = min(self.base_delay * (2 ** attempt), self.max_delay)
        jitter_range = delay * self.jitter
        return delay + random.uniform(-jitter_range, jitter_range)

class CircuitBreaker:
    """Circuit breaker to prevent cascading failures."""
    def __init__(self, failure_threshold: int = 3, recovery_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open

    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "open"
            print(f"[CircuitBreaker] Opened after {self.failures} failures")

    def record_success(self):
        self.failures = 0
        self.state = "closed"

    def is_available(self) -> bool:
        if self.state == "closed":
            return True
        if self.state == "open":
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = "half-open"
                return True
            return False
        # half-open: allow one test request
        return True

Step 3: Implement the retry decorator

def with_retry(agent_fn: Callable, config: RetryConfig, breaker: CircuitBreaker):
    """Decorator applying retry and circuit breaker logic."""
    def wrapper(*args, **kwargs):
        if not breaker.is_available():
            raise RuntimeError("[CircuitBreaker] Circuit is open. Request rejected.")

        last_error = None
        for attempt in range(config.max_attempts):
            try:
                result = agent_fn(*args, **kwargs)
                breaker.record_success()
                return result
            except Exception as e:
                last_error = e
                breaker.record_failure()
                if attempt < config.max_attempts - 1:
                    delay = config.get_delay(attempt)
                    print(f"[Retry] Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s")
                    time.sleep(delay)
                else:
                    print(f"[Retry] All {config.max_attempts} attempts exhausted.")

        raise last_error
    return wrapper

Step 4: Apply retry logic to an agent function

import random

def call_llm(prompt: str) -> str:
    """Simulated LLM call that fails intermittently."""
    if random.random() < 0.3:
        raise ConnectionError("Simulated transient network failure")
    return f"Response to: {prompt}"

config = RetryConfig(max_attempts=3, base_delay=0.5, max_delay=5.0)
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10.0)

retrying_llm_call = with_retry(call_llm, config, breaker)

# Test the agent
for i in range(10):
    try:
        result = retrying_llm_call(f"task_{i}")
        print(f"Success: {result}")
    except RuntimeError as e:
        print(f"Circuit open: {e}")
        time.sleep(2)
    except ConnectionError as e:
        print(f"Retry failed: {e}")

Step 5: Implement dead-letter queue for persistent failures

from collections import deque

class DeadLetterQueue:
    """Queue for tasks that fail after all retry attempts."""
    def __init__(self, max_size: int = 100):
        self.queue = deque(maxlen=max_size)

    def add(self, task: Any, error: Exception):
        self.queue.append({
            "task": task,
            "error": str(error),
            "timestamp": time.time()
        })
        print(f"[DLQ] Task added. Queue size: {len(self.queue)}")

    def get_failed(self) -> list:
        return list(self.queue)

dlq = DeadLetterQueue()

def with_dlq(agent_fn: Callable, config: RetryConfig, breaker: CircuitBreaker, dlq: DeadLetterQueue):
    """Extended decorator that routes persistently failed tasks to DLQ."""
    def wrapper(*args, **kwargs):
        try:
            return with_retry(agent_fn, config, breaker)(*args, **kwargs)
        except Exception as e:
            dlq.add({"args": args, "kwargs": kwargs}, e)
            return None
    return wrapper

Verification

Run the test loop and verify:

Successful calls return a response string starting with "Response to:".
Transient failures are retried automatically with increasing delays.
After 3 consecutive failures, the circuit breaker opens and subsequent calls raise RuntimeError with "Circuit is open".
After the recovery timeout (10 seconds), the circuit enters half-open state and allows one test call.
Failed tasks appear in the DLQ after exhausting all retries.

Common failures

Retrying non-transient errors. Retrying a 400 Bad Request response wastes API quota and delays discovery of the real problem. Inspect the exception type and only retry on ConnectionError, Timeout, or 5xx HTTP codes.
Circuit breaker opening too aggressively. A low failure_threshold combined with a short recovery_timeout can cause the circuit to oscillate between open and closed states during partial outages. Set the threshold to at least 3-5 failures and the timeout to 60+ seconds for most LLM APIs.
Missing DLQ monitoring. Dead-letter queues that grow without inspection create silent failures in production. Add a scheduled task or webhook that alerts when DLQ depth exceeds a threshold.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

Implement Parallel Agent Execution - Combine retry logic with parallel fan-out so that failed agent tasks are rescheduled alongside other agents without blocking the entire pipeline.
Implement Streaming Responses in AI APIs - Stream responses with retry logic so partial outputs are preserved even when a connection drops mid-stream.