RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hybrid Local-Cloud AI Architecture
  6. /Ch. 4
Hybrid Local-Cloud AI Architecture

04. Model Router Architecture

Chapter 4 of 18 · 15 min
KEY INSIGHT

Model router architecture focuses on policy enforcement and backend coordination. Separating this central authority from inference execution enables architectural flexibility as requirements evolve.

The model router serves as the central orchestration component in hybrid AI architecture. This service receives inference requests, applies routing policies, delegates to selected backends, and returns responses. Architectural decisions for this component determine system scalability, reliability, and extensibility.

Service-oriented architecture positions the router as an independent microservice. Request forwarding happens via internal network calls to backend inference services. This separation enables independent scaling of routing logic and inference capacity. Protocol translation adapts between client protocols and backend capabilities.

Synchronous request handling suits interactive use cases where responses return immediately. The router waits for backend completion before replying to callers. This simplicity comes at the cost of holding connections open during inference. Timeout configuration becomes critical because backend latency directly impacts client experience.

Asynchronous patterns decouple routing from response delivery. Clients submit requests and receive identifiers. Polling or webhook callbacks deliver results later. This architecture accommodates long-running inference tasks without connection management complexity. Partial failures become recoverable because the router owns the retry lifecycle.

Health checking maintains backend availability awareness. Periodic probes measure response latency and correctness. Statistical confidence intervals filter measurement noise. Unhealthy backend exclusion prevents request routing toward failing services. Recovery verification confirms restoration before rejoining the healthy pool.

#!/bin/bash
# Health check script for backend model services
BACKEND_URL="${1:-http://localhost:11434/api/embeddings}"
EXPECTED_LATENCY_MS="${2:-500}"

response=$(curl -s -w "\n%{time_total}" "${BACKEND_URL}" \
  -o /dev/null 2>&1)
  
latency=$(echo "$response" | tail -1)
latency_ms=$(echo "$latency * 1000" | bc)

if (( $(echo "$latency_ms < $EXPECTED_LATENCY_MS" | bc -l) )); then
  echo "STATUS=healthy,LATENCY_MS=${latency_ms}"
  exit 0
else
  echo "STATUS=degraded,LATENCY_MS=${latency_ms}"
  exit 1
fi

Load distribution across homogeneous backends requires careful allocation. Round-robin distributes evenly but ignores varying request sizes. Least-connections tracks outstanding work better. Token bucket rate limiting respects backend capacity limits directly. Weighted allocation accommodates heterogeneous backends with varying capability.

Graceful degradation preserves service continuity during backend degradation. Queue-based buffering absorbs traffic spikes while backends recover. Degraded mode pivots toward reduced-capability backends that remain available. Informative error responses guide clients toward retry strategies.

EXERCISE

Design a model router deployment topology for three geographically distributed inference clusters. Address how routing state synchronizes across regions and how client traffic reaches the nearest router instance.

← Chapter 3
Rule-Based Routing
Chapter 5 →
Cost-Aware Selection