19. High Availability
Production inference serving requires resilience against component failures without service degradation. High availability architecture eliminates single points of failure across the entire serving stack: load balancers, model servers, and storage backends.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Configure a multi-replica inference deployment in Kubernetes with pod anti-affinity rules that spread replicas across nodes. Set up readiness probes and verify that traffic stops routing to a pod during startup delay. Test node failure by cordoning a node and confirming automatic redistribution of inference pods.