10. Fallback Chains
Fallback chains represent the architectural backbone of resilient AI systems. When a primary model fails—whether due to latency spikes, quota exhaustion, or outright service disruption—the gateway must gracefully transition to alternative endpoints without exposing failures to end users.
A well-designed fallback chain follows a priority order: local models first (lowest latency, zero cost), then approved cloud providers in sequence, finally reaching a graceful degradation state. The chain evaluation happens synchronously within a configurable timeout window, typically 3-5 seconds for interactive applications.
The implementation tracks failure types. Timeout errors warrant immediate failover. Authentication failures suggest credential rotation is needed and should alert operators. Rate limit errors may indicate the chain should pause briefly before retrying. Semantic mismatches—cases where responses fail validation—trigger distinct handling paths that may involve model swapping or prompt restructuring.
class FallbackChain:
def __init__(self, providers: list[AIProvider], config: ChainConfig):
self.providers = providers
self.timeout = config.timeout_ms
self.max_retries = config.max_retries
self.failover_strategy = config.strategy
async def execute(self, prompt: str, context: dict) -> Response:
last_error = None
for provider in self.providers:
for attempt in range(self.max_retries):
try:
response = await provider.call(prompt, context, self.timeout)
if self._validate_response(response):
return response
last_error = ValidationError(f"Invalid response from {provider.name}")
except TimeoutError:
last_error = TimeoutError(f"{provider.name} exceeded timeout")
break # Move to next provider
except QuotaExceeded:
last_error = QuotaError(f"{provider.name} quota exhausted")
await self._notify_operators(provider)
break
except AuthError:
last_error = AuthError(f"{provider.name} auth failed")
await self._critical_alert(provider)
break
return self._degraded_response(last_error)
The chain should expose metrics for observability: attempted providers per request, final outcome, time-to-fallback, and failure classification. These metrics inform capacity planning and provider relationship management.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement a fallback chain that includes your local model, one cloud provider, and a degraded text-only fallback. Instrument it to track which provider handles each request and observe behavior under simulated failures.