RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI APIs and Integration
  6. /Ch. 7
Local AI APIs and Integration

07. Rate Limiting

Chapter 7 of 18 · 20 min
KEY INSIGHT

Rate limiting protects your inference infrastructure from overload. Without it, a single client can consume all available GPU memory, causing degraded service for everyone. Token bucket and sliding window algorithms provide the right balance between fairness and burst handling. ### Token Bucket Implementation ```python import time from collections import defaultdict class RateLimiter: def __init__(self, rate: int, per_seconds: int): self.rate = rate self.per_seconds = per_seconds self.buckets = defaultdict(lambda: {"tokens": rate, "last_refill": time.time()}) def allow_request(self, key: str) -> bool: bucket = self.buckets[key] now = time.time() # Refill tokens elapsed = now - bucket["last_refill"] refill = (elapsed / self.per_seconds) * self.rate bucket["tokens"] = min(self.rate, bucket["tokens"] + refill) bucket["last_refill"] = now if bucket["tokens"] >= 1: bucket["tokens"] -= 1 return True return False ``` Each client maintains a bucket that refills at a fixed rate. Requests consume tokens. When the bucket is empty, requests are rejected. ### Integration with FastAPI ```python limiter = RateLimiter(rate=60, per_seconds=60) # 60 requests per minute @app.middleware("http") async def rate_limit_middleware(request: Request, call_next): client_id = request.headers.get("X-API-Key", request.client.host) if not limiter.allow_request(client_id): return JSONResponse( status_code=429, content={"error": {"message": "Rate limit exceeded", "type": "rate_limit_error"}} ) response = await call_next(request) return response ``` The middleware intercepts every request before it reaches the endpoint. Rate limit exceeded responses include a body matching the OpenAI error format. ### Response Headers Inform clients about their remaining quota using response headers. ```python response.headers["X-RateLimit-Limit"] = "60" response.headers["X-RateLimit-Remaining"] = str(remaining_tokens) response.headers["X-RateLimit-Reset"] = str(reset_timestamp) ``` These headers allow clients to implement their own backoff strategies rather than blindly retrying. ### Failure Modes Rate limit state stored in memory breaks when running multiple workers. Use Redis for distributed rate limiting across worker processes. Setting limits too low causes false positives for legitimate usage spikes.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a rate limiter that allows 10 requests per minute per API key. Verify that the 11th request within a minute receives a 429 response while the first 10 succeed.

← Chapter 6
API Key Authentication
Chapter 8 →
Multi-Model Gateway