17. Rate Limiting and Retries

Chapter 17 of 36 · 15 min

Why Rate Limits Exist

AI APIs limit requests per minute to prevent abuse and ensure fair access. Exceeding limits returns 429 responses. Your code must handle this gracefully.

Detecting Rate Limits

import requests

response = requests.post(url, headers=headers, json=payload)

if response.status_code == 429:
    print("Rate limited")
    retry_after = response.headers.get("Retry-After", 60)
    print(f"Wait {retry_after} seconds")

Manual Retry Logic

import time
import requests

def call_with_retry(url, headers, payload, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=payload)
        
        if response.status_code == 200:
            return response.json()
        
        if response.status_code == 429:
            delay = int(response.headers.get("Retry-After", base_delay * 2 ** attempt))
            print(f"Rate limited. Waiting {delay}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)
        else:
            response.raise_for_status()
    
    raise Exception(f"Failed after {max_retries} attempts")

Exponential Backoff

Wait longer between each retry:

def call_with_backoff(url, headers, payload, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = requests.post(url, headers=headers, json=payload, timeout=30)
            response.raise_for_status()
            return response.json()
        except (requests.exceptions.HTTPError, requests.exceptions.Timeout) as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait}s")
            time.sleep(wait)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Write a function that simulates an API call (use a counter to make it fail twice then succeed). Implement exponential backoff with increasing delays. Time the total execution.