How to implement rate limiting for AI APIs
FastAPI or similar, rate limiting library
What this does
Implementing rate limiting for AI APIs prevents abuse, ensures fair resource allocation, and protects backend model servers from overload. Rate limiting caps the number of requests per client (identified by API key or IP address) within a time window—for example, 60 requests per minute for free-tier users. When a client exceeds the limit, the API returns a 429 Too Many Requests response with a Retry-After header. This protects expensive GPU inference resources and ensures consistent latency for all users.
Steps
Install dependencies: pip install slowapi redis (Redis for distributed setups). In the FastAPI application, import and configure the rate limiter: from slowapi import Limiter; from slowapi.util import get_remote_address; limiter = Limiter(key_func=get_remote_address). Attach to the app: app.state.limiter = limiter. Apply rate limits to endpoints using decorators. For the inference endpoint, apply a per-client limit: @app.post("/v1/completions"); @limiter.limit("60/minute"). For the health endpoint, use a higher limit: @limiter.limit("600/minute"). Customize the key function for API-key-based limiting: def get_api_key(request: Request): return request.headers.get("X-API-Key", request.client.host). Pass this to Limiter: limiter = Limiter(key_func=get_api_key). In Redis-backed mode, configure the storage: from slowapi.storage import RedisStorage; storage = RedisStorage("redis://localhost:6379/0"); limiter = Limiter(key_func=get_api_key, storage=storage). Add proper error responses by creating an exception handler for the 429 status: return a JSON body with {"error": "rate_limit_exceeded", "message": "Too many requests. Try again in 30 seconds.", "retry_after": 30} and set the Retry-After header. For tiered limits, create multiple limiter instances or use the @limiter.limit decorator with dynamic values based on the authenticated user's tier. Add rate limit headers to all responses: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset so clients can self-regulate. Test limits with a load testing tool: hey -n 100 -c 10 <endpoint>.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Send 10 rapid requests and verify the 11th returns 429 with the correct JSON error body. Check that the Retry-After header is present and set to a reasonable value. Verify rate limit headers appear on successful requests. Test with two different API keys: confirm each gets independent rate limit counters. Restart the server (or flush Redis) and verify counters reset. Test the health endpoint: it should not be rate-limited at the same threshold as the inference endpoint.
Common failures
Rate limiter counting all requests as one client: Verify the key_func is correctly extracting the API key or IP—add logging to print the resolved key during development. Redis connection failure: Implement a fallback to in-memory storage with a logged warning so the API remains functional during Redis outages. Inconsistent limits in multi-instance deployments: Only Redis-backed storage provides consistent limits across instances; in-memory storage is per-process. Rate limit headers not updating: Ensure the limiter middleware is placed before the route handler in the middleware stack. False positives from proxy IPs: If behind a reverse proxy, use X-Forwarded-For header (with trust) instead of request.client.host.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- setup-authentication-local-ai-endpoints
- setup-auto-scaling-llm-inference
- build-multi-tenant-ai-serving