Model Router Architecture — Hybrid Local-Cloud AI Architecture (Chapter 4)

The model router serves as the central orchestration component in hybrid AI architecture. This service receives inference requests, applies routing policies, delegates to selected backends, and returns responses. Architectural decisions for this component determine system scalability, reliability, and extensibility.

Service-oriented architecture positions the router as an independent microservice. Request forwarding happens via internal network calls to backend inference services. This separation enables independent scaling of routing logic and inference capacity. Protocol translation adapts between client protocols and backend capabilities.

Synchronous request handling suits interactive use cases where responses return immediately. The router waits for backend completion before replying to callers. This simplicity comes at the cost of holding connections open during inference. Timeout configuration becomes critical because backend latency directly impacts client experience.

Asynchronous patterns decouple routing from response delivery. Clients submit requests and receive identifiers. Polling or webhook callbacks deliver results later. This architecture accommodates long-running inference tasks without connection management complexity. Partial failures become recoverable because the router owns the retry lifecycle.

Health checking maintains backend availability awareness. Periodic probes measure response latency and correctness. Statistical confidence intervals filter measurement noise. Unhealthy backend exclusion prevents request routing toward failing services. Recovery verification confirms restoration before rejoining the healthy pool.

#!/bin/bash
# Health check script for backend model services
BACKEND_URL="${1:-http://localhost:11434/api/embeddings}"
EXPECTED_LATENCY_MS="${2:-500}"

response=$(curl -s -w "\n%{time_total}" "${BACKEND_URL}" \
  -o /dev/null 2>&1)
  
latency=$(echo "$response" | tail -1)
latency_ms=$(echo "$latency * 1000" | bc)

if (( $(echo "$latency_ms < $EXPECTED_LATENCY_MS" | bc -l) )); then
  echo "STATUS=healthy,LATENCY_MS=${latency_ms}"
  exit 0
else
  echo "STATUS=degraded,LATENCY_MS=${latency_ms}"
  exit 1
fi

Load distribution across homogeneous backends requires careful allocation. Round-robin distributes evenly but ignores varying request sizes. Least-connections tracks outstanding work better. Token bucket rate limiting respects backend capacity limits directly. Weighted allocation accommodates heterogeneous backends with varying capability.

Graceful degradation preserves service continuity during backend degradation. Queue-based buffering absorbs traffic spikes while backends recover. Degraded mode pivots toward reduced-capability backends that remain available. Informative error responses guide clients toward retry strategies.