What this does

Rate limiting prevents the agent from overwhelming external APIs (search engines, databases, LLM endpoints) by capping the number of calls within a time window. This protects both the service and your budget.

Steps

Install a rate limiting library.

pip install limiter

Apply a per-tool rate limit. Use a token bucket or sliding window algorithm.

from limiter import Limiter
import asyncio

# 10 calls per minute for web search
search_limiter = Limiter(rate=10, per=60)

@tool
def web_search(query: str) -> str:
    """Search the web (rate limited: 10/min)."""
    with search_limiter:
        return actual_search(query)

Implement a global rate limiter for all external calls.

import time
from collections import deque

class GlobalRateLimiter:
    def __init__(self, max_calls: int = 30, window_seconds: int = 60):
        self.max_calls = max_calls
        self.window = window_seconds
        self.timestamps = deque()

    def wait_if_needed(self):
        now = time.time()
        # Remove old timestamps
        while self.timestamps and self.timestamps[0] < now - self.window:
            self.timestamps.popleft()

        if len(self.timestamps) >= self.max_calls:
            sleep_time = self.timestamps[0] + self.window - now + 0.1
            time.sleep(max(0, sleep_time))

        self.timestamps.append(time.time())

limiter = GlobalRateLimiter(max_calls=30, window_seconds=60)

Track rate limits per external service. Different APIs have different limits.

class ServiceRateLimiter:
    def __init__(self):
        self.limiters = {}

    def get_limiter(self, service: str, max_calls: int, window: int):
        if service not in self.limiters:
            self.limiters[service] = GlobalRateLimiter(max_calls, window)
        return self.limiters[service]

rate_limiter = ServiceRateLimiter()

# Different services have different limits
search_rate = rate_limiter.get_limiter("tavily", max_calls=10, window=60)
db_rate = rate_limiter.get_limiter("database", max_calls=100, window=60)

Handle rate limit errors from the API. If the API returns 429, back off.

import random

def call_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "429" in str(e) or "rate limit" in str(e).lower():
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise
    raise Exception("Rate limit retries exhausted")

Return rate limit status to the agent. Let the agent know if it's being throttled.

@tool
def check_rate_limit_status() -> str:
    """Check current rate limit usage for all services."""
    status = []
    for service, limiter in rate_limiter.limiters.items():
        remaining = limiter.max_calls - len(limiter.timestamps)
        status.append(f"{service}: {remaining}/{limiter.max_calls} calls remaining")
    return "\n".join(status)

Verification

python -c "
from collections import deque
import time
d = deque()
for _ in range(3):
    d.append(time.time())
    time.sleep(0.01)
# Only keep entries from last 1 second
while d and d[0] < time.time() - 1:
    d.popleft()
print(len(d))
# Expected: 3 (all within 1 second)
"

Common failures

Distributed agent instances. Rate limiting per process doesn't work when multiple agent instances call the same API. Use a Redis-backed rate limiter for distributed systems.
Limiter blocks urgent calls. If a user waits for a response, rate limiting delays it further. Prioritize user-facing calls over background ones.
Token bucket overflow. A burst of tool calls at the start of the window exhausts the budget for the rest of the window. Smooth requests with a steady rate.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

How to Handle Function Call Errors and Retries
How to Manage Agent Context Window Limits