RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI APIs and Integration
  6. /Ch. 9
Local AI APIs and Integration

09. Model Routing

Chapter 9 of 18 · 20 min
KEY INSIGHT

Model routing extends the gateway concept by dynamically selecting the optimal model based on request characteristics, not just the model identifier. This enables cost optimization, latency reduction, and load distribution across heterogeneous hardware. ### Routing Criteria Model selection can consider multiple factors: request complexity, latency requirements, cost constraints, and hardware availability. A simple request might route to a smaller, faster model while a complex analysis request routes to a larger, more capable model. ```python class RoutingStrategy: def select_model(self, request: CompletionRequest) -> str: raise NotImplementedError class LatencyRouting(RoutingStrategy): def select_model(self, request: CompletionRequest) -> str: # Route to fastest available model return min( MODEL_REGISTRY.items(), key=lambda item: item[1].avg_latency )[0] class CostRouting(RoutingStrategy): def select_model(self, request: CompletionRequest) -> str: # Route to cheapest model that meets requirements for model_name, config in MODEL_REGISTRY.items(): if config.capability_score >= self._estimate_required_score(request): return model_name return "default-model" ``` ### Request Classification ```python def classify_request(request: CompletionRequest) -> str: # Estimate complexity based on message count and content length total_tokens = sum(len(m.content) for m in request.messages) if len(request.messages) <= 2 and total_tokens < 200: return "simple" elif len(request.messages) <= 5 and total_tokens < 1000: return "medium" return "complex" ``` This simple heuristic estimates request complexity without making an inference call. More sophisticated approaches use a lightweight classifier or maintain historical performance data. ### Dynamic Routing Implementation ```python class DynamicRouter: def __init__(self): self.strategies = { "simple": "llama3.2-tiny", "medium": "llama3.2", "complex": "llama3.2-large" } async def route(self, request: CompletionRequest) -> dict: classification = classify_request(request) target_model = self.strategies.get(classification, "llama3.2") # Override if client explicitly requests a model if request.model not in ("auto", "dynamic"): target_model = request.model return await self.forward_to_model(target_model, request) ``` The router classifies incoming requests and selects the target model. Explicit model requests override the automatic selection, preserving user intent. ### Fallback Chains When the primary model is unavailable or times out, route to a fallback model. ```python async def route_with_fallback(request: CompletionRequest) -> dict: models = ["llama3.2-large", "llama3.2", "llama3.2-tiny"] for model in models: try: return await self.forward_to_model(model, request, timeout=10) except TimeoutError: continue raise HTTPException(status_code=503, detail="All models unavailable") ``` Fallback chains ensure availability even when specific models are overloaded or offline.

EXERCISE

Implement a dynamic router that classifies requests by token count and routes to different mock endpoints. Test the router with requests of varying sizes and verify that each reaches the appropriate target.

← Chapter 8
Multi-Model Gateway
Chapter 10 →
Request Logging