RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /AI-Powered SaaS Products
  6. /Ch. 14
AI-Powered SaaS Products

14. API Rate Limiting

Chapter 14 of 24 · 25 min
KEY INSIGHT

Rate limiting in a Nigerian SaaS must balance fair usage across different tiers while handling the bursty traffic patterns common in Lagos and Abuja internet connections, where connection quality varies significantly. Rate limiting serves two purposes: protecting system resources and enforcing tier boundaries. For a Nigerian AI SaaS, the implementation must handle variance in client behavior while providing clear feedback when limits are approached. ```python from typing import Optional from datetime import datetime, timedelta from collections import defaultdict import time import hashlib from redis import Redis class RateLimiter: """Token bucket rate limiter with Redis backend.""" def __init__(self, redis_client: Redis): self.redis = redis_client self.default_limits = { 'free': {'requests': 100, 'window': 60}, # 100/min 'starter': {'requests': 1000, 'window': 60}, # 1000/min 'professional': {'requests': 5000, 'window': 60}, # 5000/min 'enterprise': {'requests': 20000, 'window': 60} } def check_rate_limit( self, tenant_id: str, endpoint: str, tier: str ) -> dict: """Check and update rate limit, returning limit status.""" limits = self.default_limits.get(tier, self.default_limits['free']) max_requests = limits['requests'] window_seconds = limits['window'] key = f"ratelimit:{tenant_id}:{endpoint}:{int(time.time() / window_seconds)}" current = self.redis.incr(key) if current == 1: self.redis.expire(key, window_seconds) remaining = max(0, max_requests - current) reset_time = (int(time.time() / window_seconds) + 1) * window_seconds return { 'allowed': current <= max_requests, 'limit': max_requests, 'remaining': remaining, 'reset': reset_time, 'retry_after': max(0, window_seconds - (int(time.time()) % window_seconds)) if current > max_requests else None } ``` For AI endpoints specifically, rate limiting must account for token consumption, not just request counts. An AI SaaS making 50 small requests consumes different resources than 50 large requests. ```python class AIEndpointRateLimiter: """Rate limiter for AI endpoints with token awareness.""" def __init__(self, redis_client: Redis, openai_client): self.redis = redis_client self.openai = openai_client self.limits = { 'free': {'monthly_tokens': 100000}, 'starter': {'monthly_tokens': 1000000}, 'professional': {'monthly_tokens': 5000000}, 'enterprise': {'monthly_tokens': 50000000} } def check_and_consume( self, tenant_id: str, tier: str, prompt_tokens: int, completion_tokens: int ) -> tuple[bool, dict]: """Check token budget and consume from allocation.""" total_tokens = prompt_tokens + completion_tokens monthly_limit = self.limits.get(tier, {}).get('monthly_tokens', 0) month_start = datetime.utcnow().replace(day=1, hour=0, minute=0, second=0, microsecond=0) key = f"ai_tokens:{tenant_id}:{month_start.isoformat()}" current_usage = int(self.redis.get(key) or 0) new_usage = current_usage + total_tokens if new_usage > monthly_limit: return False, { 'current_usage': current_usage, 'monthly_limit': monthly_limit, 'requested': total_tokens, 'shortage': new_usage - monthly_limit } self.redis.incrby(key, total_tokens) self.redis.expire(key, 86400 * 35) # Keep for full billing period return True, { 'current_usage': new_usage, 'monthly_limit': monthly_limit, 'remaining': monthly_limit - new_usage } ``` **Handling Burst Traffic:** Nigerian internet connections often have inconsistent latency, leading to clients retrying failed requests multiple times. This creates artificial spikes in rate limit consumption. ```python def handle_burst_gracefully( limiter: RateLimiter, tenant_id: str, endpoint: str, tier: str, request_count: int ) -> dict: """Handle burst requests with jitter-aware limiting.""" result = limiter.check_rate_limit(tenant_id, endpoint, tier) if not result['allowed']: # Apply jitter to spread retry attempts jitter = random.uniform(1.1, 1.5) base_wait = result['retry_after'] or 60 suggested_wait = int(base_wait * jitter) return { 'allowed': False, 'retry_after': min(suggested_wait, 300), # Cap at 5 minutes 'message': f"Rate limit exceeded. Retry after {suggested_wait} seconds." } return result ``` **Common Failure Modes:** Redis rate limiting fails silently when Redis is unavailable, causing all checks to pass and overwhelming the system. Always implement a fallback that denies requests when the rate limiter cannot confirm allowance. ```python def check_rate_limit_with_fallback( limiter: RateLimiter, tenant_id: str, endpoint: str, tier: str ) -> dict: """Check rate limit with circuit breaker fallback.""" try: return limiter.check_rate_limit(tenant_id, endpoint, tier) except RedisConnectionError: logger.error(f"Redis unavailable for rate limit check, tenant {tenant_id}") # Fail closed: deny request when limiter unavailable return { 'allowed': False, 'error': 'rate_limiter_unavailable', 'retry_after': 5 } ```

EXERCISE

Implement a rate limiter that distinguishes between different AI model endpoints (GPT-3.5 vs GPT-4) with different limits per model. Create tier-based limits where enterprise tenants can use more GPT-4 tokens while starter tenants have higher GPT-3.5 allowances. Test the implementation with mock Redis to verify token tracking across billing periods.

← Chapter 13
Invoice Generation
Chapter 15 →
Quota Management