17. Rate Limiting and Retries
Why Rate Limits Exist
AI APIs limit requests per minute to prevent abuse and ensure fair access. Exceeding limits returns 429 responses. Your code must handle this gracefully.
Detecting Rate Limits
import requests
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 429:
print("Rate limited")
retry_after = response.headers.get("Retry-After", 60)
print(f"Wait {retry_after} seconds")
Manual Retry Logic
import time
import requests
def call_with_retry(url, headers, payload, max_retries=3, base_delay=1):
for attempt in range(max_retries):
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 200:
return response.json()
if response.status_code == 429:
delay = int(response.headers.get("Retry-After", base_delay * 2 ** attempt))
print(f"Rate limited. Waiting {delay}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
else:
response.raise_for_status()
raise Exception(f"Failed after {max_retries} attempts")
Exponential Backoff
Wait longer between each retry:
def call_with_backoff(url, headers, payload, max_retries=5):
for attempt in range(max_retries):
try:
response = requests.post(url, headers=headers, json=payload, timeout=30)
response.raise_for_status()
return response.json()
except (requests.exceptions.HTTPError, requests.exceptions.Timeout) as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait}s")
time.sleep(wait)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Write a function that simulates an API call (use a counter to make it fail twice then succeed). Implement exponential backoff with increasing delays. Time the total execution.