HOW-TO · RAG
How to Add Rate Limiting to Agent Tool Calls
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Agent with tool calls, rate limiting library, Python 3.10+
What this does
Rate limiting prevents the agent from overwhelming external APIs (search engines, databases, LLM endpoints) by capping the number of calls within a time window. This protects both the service and your budget.
Steps
- Install a rate limiting library.
pip install limiter
- Apply a per-tool rate limit. Use a token bucket or sliding window algorithm.
from limiter import Limiter
import asyncio
# 10 calls per minute for web search
search_limiter = Limiter(rate=10, per=60)
@tool
def web_search(query: str) -> str:
"""Search the web (rate limited: 10/min)."""
with search_limiter:
return actual_search(query)
- Implement a global rate limiter for all external calls.
import time
from collections import deque
class GlobalRateLimiter:
def __init__(self, max_calls: int = 30, window_seconds: int = 60):
self.max_calls = max_calls
self.window = window_seconds
self.timestamps = deque()
def wait_if_needed(self):
now = time.time()
# Remove old timestamps
while self.timestamps and self.timestamps[0] < now - self.window:
self.timestamps.popleft()
if len(self.timestamps) >= self.max_calls:
sleep_time = self.timestamps[0] + self.window - now + 0.1
time.sleep(max(0, sleep_time))
self.timestamps.append(time.time())
limiter = GlobalRateLimiter(max_calls=30, window_seconds=60)
- Track rate limits per external service. Different APIs have different limits.
class ServiceRateLimiter:
def __init__(self):
self.limiters = {}
def get_limiter(self, service: str, max_calls: int, window: int):
if service not in self.limiters:
self.limiters[service] = GlobalRateLimiter(max_calls, window)
return self.limiters[service]
rate_limiter = ServiceRateLimiter()
# Different services have different limits
search_rate = rate_limiter.get_limiter("tavily", max_calls=10, window=60)
db_rate = rate_limiter.get_limiter("database", max_calls=100, window=60)
- Handle rate limit errors from the API. If the API returns 429, back off.
import random
def call_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "429" in str(e) or "rate limit" in str(e).lower():
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
raise
raise Exception("Rate limit retries exhausted")
- Return rate limit status to the agent. Let the agent know if it's being throttled.
@tool
def check_rate_limit_status() -> str:
"""Check current rate limit usage for all services."""
status = []
for service, limiter in rate_limiter.limiters.items():
remaining = limiter.max_calls - len(limiter.timestamps)
status.append(f"{service}: {remaining}/{limiter.max_calls} calls remaining")
return "\n".join(status)
Verification
python -c "
from collections import deque
import time
d = deque()
for _ in range(3):
d.append(time.time())
time.sleep(0.01)
# Only keep entries from last 1 second
while d and d[0] < time.time() - 1:
d.popleft()
print(len(d))
# Expected: 3 (all within 1 second)
"
Common failures
- Distributed agent instances. Rate limiting per process doesn't work when multiple agent instances call the same API. Use a Redis-backed rate limiter for distributed systems.
- Limiter blocks urgent calls. If a user waits for a response, rate limiting delays it further. Prioritize user-facing calls over background ones.
- Token bucket overflow. A burst of tool calls at the start of the window exhausts the budget for the rest of the window. Smooth requests with a steady rate.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- How to Handle Function Call Errors and Retries
- How to Manage Agent Context Window Limits