09. Model Routing

Chapter 9 of 18 · 20 min

KEY INSIGHT

Model routing extends the gateway concept by dynamically selecting the optimal model based on request characteristics, not just the model identifier. This enables cost optimization, latency reduction, and load distribution across heterogeneous hardware. ### Routing Criteria Model selection can consider multiple factors: request complexity, latency requirements, cost constraints, and hardware availability. A simple request might route to a smaller, faster model while a complex analysis request routes to a larger, more capable model. ```python class RoutingStrategy: def select_model(self, request: CompletionRequest) -> str: raise NotImplementedError class LatencyRouting(RoutingStrategy): def select_model(self, request: CompletionRequest) -> str: # Route to fastest available model return min( MODEL_REGISTRY.items(), key=lambda item: item[1].avg_latency )[0] class CostRouting(RoutingStrategy): def select_model(self, request: CompletionRequest) -> str: # Route to cheapest model that meets requirements for model_name, config in MODEL_REGISTRY.items(): if config.capability_score >= self._estimate_required_score(request): return model_name return "default-model" ``` ### Request Classification ```python def classify_request(request: CompletionRequest) -> str: # Estimate complexity based on message count and content length total_tokens = sum(len(m.content) for m in request.messages) if len(request.messages) <= 2 and total_tokens < 200: return "simple" elif len(request.messages) <= 5 and total_tokens < 1000: return "medium" return "complex" ``` This simple heuristic estimates request complexity without making an inference call. More sophisticated approaches use a lightweight classifier or maintain historical performance data. ### Dynamic Routing Implementation ```python class DynamicRouter: def __init__(self): self.strategies = { "simple": "llama3.2-tiny", "medium": "llama3.2", "complex": "llama3.2-large" } async def route(self, request: CompletionRequest) -> dict: classification = classify_request(request) target_model = self.strategies.get(classification, "llama3.2") # Override if client explicitly requests a model if request.model not in ("auto", "dynamic"): target_model = request.model return await self.forward_to_model(target_model, request) ``` The router classifies incoming requests and selects the target model. Explicit model requests override the automatic selection, preserving user intent. ### Fallback Chains When the primary model is unavailable or times out, route to a fallback model. ```python async def route_with_fallback(request: CompletionRequest) -> dict: models = ["llama3.2-large", "llama3.2", "llama3.2-tiny"] for model in models: try: return await self.forward_to_model(model, request, timeout=10) except TimeoutError: continue raise HTTPException(status_code=503, detail="All models unavailable") ``` Fallback chains ensure availability even when specific models are overloaded or offline.

EXERCISE

Implement a dynamic router that classifies requests by token count and routes to different mock endpoints. Test the router with requests of varying sizes and verify that each reaches the appropriate target.