19. High Availability

Chapter 19 of 24 · 20 min

KEY INSIGHT

High availability is not achieved by simply running multiple replicas; each component—networking, storage, and compute—must have redundant paths with automatic failover. ### Multi-AZ Model Server Deployment ```yaml # inference-ha-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-server namespace: production spec: replicas: 3 selector: <<<<<<< HEAD matchLabels: app: inference-server topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: inference-server podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: inference-server topologyKey: kubernetes.io/hostname ======= matchExpressions: - key: app operator: In values: - inference-server strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 >>>>>>> local template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:v2.1.0 resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30 env: - name: MODEL_NAME value: "production-model" - name: GPU_DEVICE_IDS value: "0" ``` ### Redis HA for Request Caching ```yaml # redis-ha.yaml apiVersion: redis.redis.redis.com/v1 kind: RedisCluster metadata: name: inference-cache namespace: production spec: clusterSize: 3 persistence: enabled: false kubernetesConfig: resources: requests: cpu: 500m memory: 1Gi limits: cpu: 1000m memory: 2Gi tls: enabled: true secretName: redis-tls-cert ``` ### Database Connection pooling for Model Metadata ```python # database_pool.py from sqlalchemy.pool import QueuePool from sqlalchemy import create_engine engine = create_engine( "postgresql://user:pass@pg-primary:5432/inference", poolclass=QueuePool, pool_size=20, max_overflow=10, pool_pre_ping=True, connect_args={ "options": "-c pool_mode=transaction" } ) # Enable automatic failover def get_read_node(): """Route read queries to replica.""" return create_engine( "postgresql://user:pass@pg-replica:5432/inference", poolclass=QueuePool, pool_size=10, connect_args={ "options": "-c pool_mode=transaction" } ) ```

Production inference serving requires resilience against component failures without service degradation. High availability architecture eliminates single points of failure across the entire serving stack: load balancers, model servers, and storage backends.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Configure a multi-replica inference deployment in Kubernetes with pod anti-affinity rules that spread replicas across nodes. Set up readiness probes and verify that traffic stops routing to a pod during startup delay. Test node failure by cordoning a node and confirming automatic redistribution of inference pods.