RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Add Rate Limiting to Agent Tool Calls
HOW-TO · RAG

How to Add Rate Limiting to Agent Tool Calls

intermediate·15 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Agent with tool calls, rate limiting library, Python 3.10+

What this does

Rate limiting prevents the agent from overwhelming external APIs (search engines, databases, LLM endpoints) by capping the number of calls within a time window. This protects both the service and your budget.

Steps

  • Install a rate limiting library.
pip install limiter
  • Apply a per-tool rate limit. Use a token bucket or sliding window algorithm.
from limiter import Limiter
import asyncio

# 10 calls per minute for web search
search_limiter = Limiter(rate=10, per=60)

@tool
def web_search(query: str) -> str:
    """Search the web (rate limited: 10/min)."""
    with search_limiter:
        return actual_search(query)
  • Implement a global rate limiter for all external calls.
import time
from collections import deque

class GlobalRateLimiter:
    def __init__(self, max_calls: int = 30, window_seconds: int = 60):
        self.max_calls = max_calls
        self.window = window_seconds
        self.timestamps = deque()

    def wait_if_needed(self):
        now = time.time()
        # Remove old timestamps
        while self.timestamps and self.timestamps[0] < now - self.window:
            self.timestamps.popleft()

        if len(self.timestamps) >= self.max_calls:
            sleep_time = self.timestamps[0] + self.window - now + 0.1
            time.sleep(max(0, sleep_time))

        self.timestamps.append(time.time())

limiter = GlobalRateLimiter(max_calls=30, window_seconds=60)
  • Track rate limits per external service. Different APIs have different limits.
class ServiceRateLimiter:
    def __init__(self):
        self.limiters = {}

    def get_limiter(self, service: str, max_calls: int, window: int):
        if service not in self.limiters:
            self.limiters[service] = GlobalRateLimiter(max_calls, window)
        return self.limiters[service]

rate_limiter = ServiceRateLimiter()

# Different services have different limits
search_rate = rate_limiter.get_limiter("tavily", max_calls=10, window=60)
db_rate = rate_limiter.get_limiter("database", max_calls=100, window=60)
  • Handle rate limit errors from the API. If the API returns 429, back off.
import random

def call_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "429" in str(e) or "rate limit" in str(e).lower():
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise
    raise Exception("Rate limit retries exhausted")
  • Return rate limit status to the agent. Let the agent know if it's being throttled.
@tool
def check_rate_limit_status() -> str:
    """Check current rate limit usage for all services."""
    status = []
    for service, limiter in rate_limiter.limiters.items():
        remaining = limiter.max_calls - len(limiter.timestamps)
        status.append(f"{service}: {remaining}/{limiter.max_calls} calls remaining")
    return "\n".join(status)

Verification

python -c "
from collections import deque
import time
d = deque()
for _ in range(3):
    d.append(time.time())
    time.sleep(0.01)
# Only keep entries from last 1 second
while d and d[0] < time.time() - 1:
    d.popleft()
print(len(d))
# Expected: 3 (all within 1 second)
"

Common failures

  • Distributed agent instances. Rate limiting per process doesn't work when multiple agent instances call the same API. Use a Redis-backed rate limiter for distributed systems.
  • Limiter blocks urgent calls. If a user waits for a response, rate limiting delays it further. Prioritize user-facing calls over background ones.
  • Token bucket overflow. A burst of tool calls at the start of the window exhausts the budget for the rest of the window. Smooth requests with a steady rate.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • How to Handle Function Call Errors and Retries
  • How to Manage Agent Context Window Limits
← All how-to guidesCourses →