RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to implement rate limiting for AI APIs
HOW-TO · SUP

How to implement rate limiting for AI APIs

intermediate·15 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

FastAPI or similar, rate limiting library

What this does

Implementing rate limiting for AI APIs prevents abuse, ensures fair resource allocation, and protects backend model servers from overload. Rate limiting caps the number of requests per client (identified by API key or IP address) within a time window—for example, 60 requests per minute for free-tier users. When a client exceeds the limit, the API returns a 429 Too Many Requests response with a Retry-After header. This protects expensive GPU inference resources and ensures consistent latency for all users.

Steps

Install dependencies: pip install slowapi redis (Redis for distributed setups). In the FastAPI application, import and configure the rate limiter: from slowapi import Limiter; from slowapi.util import get_remote_address; limiter = Limiter(key_func=get_remote_address). Attach to the app: app.state.limiter = limiter. Apply rate limits to endpoints using decorators. For the inference endpoint, apply a per-client limit: @app.post("/v1/completions"); @limiter.limit("60/minute"). For the health endpoint, use a higher limit: @limiter.limit("600/minute"). Customize the key function for API-key-based limiting: def get_api_key(request: Request): return request.headers.get("X-API-Key", request.client.host). Pass this to Limiter: limiter = Limiter(key_func=get_api_key). In Redis-backed mode, configure the storage: from slowapi.storage import RedisStorage; storage = RedisStorage("redis://localhost:6379/0"); limiter = Limiter(key_func=get_api_key, storage=storage). Add proper error responses by creating an exception handler for the 429 status: return a JSON body with {"error": "rate_limit_exceeded", "message": "Too many requests. Try again in 30 seconds.", "retry_after": 30} and set the Retry-After header. For tiered limits, create multiple limiter instances or use the @limiter.limit decorator with dynamic values based on the authenticated user's tier. Add rate limit headers to all responses: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset so clients can self-regulate. Test limits with a load testing tool: hey -n 100 -c 10 <endpoint>.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Send 10 rapid requests and verify the 11th returns 429 with the correct JSON error body. Check that the Retry-After header is present and set to a reasonable value. Verify rate limit headers appear on successful requests. Test with two different API keys: confirm each gets independent rate limit counters. Restart the server (or flush Redis) and verify counters reset. Test the health endpoint: it should not be rate-limited at the same threshold as the inference endpoint.

Common failures

Rate limiter counting all requests as one client: Verify the key_func is correctly extracting the API key or IP—add logging to print the resolved key during development. Redis connection failure: Implement a fallback to in-memory storage with a logged warning so the API remains functional during Redis outages. Inconsistent limits in multi-instance deployments: Only Redis-backed storage provides consistent limits across instances; in-memory storage is per-process. Rate limit headers not updating: Ensure the limiter middleware is placed before the route handler in the middleware stack. False positives from proxy IPs: If behind a reverse proxy, use X-Forwarded-For header (with trust) instead of request.client.host.

  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • setup-authentication-local-ai-endpoints
  • setup-auto-scaling-llm-inference
  • build-multi-tenant-ai-serving
← All how-to guidesCourses →