RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 19
Production Local AI Deployment

19. High Availability

Chapter 19 of 24 · 20 min
KEY INSIGHT

High availability is not achieved by simply running multiple replicas; each component—networking, storage, and compute—must have redundant paths with automatic failover. ### Multi-AZ Model Server Deployment ```yaml # inference-ha-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-server namespace: production spec: replicas: 3 selector: <<<<<<< HEAD matchLabels: app: inference-server topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: inference-server podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: inference-server topologyKey: kubernetes.io/hostname ======= matchExpressions: - key: app operator: In values: - inference-server strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 >>>>>>> local template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:v2.1.0 resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30 env: - name: MODEL_NAME value: "production-model" - name: GPU_DEVICE_IDS value: "0" ``` ### Redis HA for Request Caching ```yaml # redis-ha.yaml apiVersion: redis.redis.redis.com/v1 kind: RedisCluster metadata: name: inference-cache namespace: production spec: clusterSize: 3 persistence: enabled: false kubernetesConfig: resources: requests: cpu: 500m memory: 1Gi limits: cpu: 1000m memory: 2Gi tls: enabled: true secretName: redis-tls-cert ``` ### Database Connection pooling for Model Metadata ```python # database_pool.py from sqlalchemy.pool import QueuePool from sqlalchemy import create_engine engine = create_engine( "postgresql://user:pass@pg-primary:5432/inference", poolclass=QueuePool, pool_size=20, max_overflow=10, pool_pre_ping=True, connect_args={ "options": "-c pool_mode=transaction" } ) # Enable automatic failover def get_read_node(): """Route read queries to replica.""" return create_engine( "postgresql://user:pass@pg-replica:5432/inference", poolclass=QueuePool, pool_size=10, connect_args={ "options": "-c pool_mode=transaction" } ) ```

Production inference serving requires resilience against component failures without service degradation. High availability architecture eliminates single points of failure across the entire serving stack: load balancers, model servers, and storage backends.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Configure a multi-replica inference deployment in Kubernetes with pod anti-affinity rules that spread replicas across nodes. Set up readiness probes and verify that traffic stops routing to a pod during startup delay. Test node failure by cordoning a node and confirming automatic redistribution of inference pods.

← Chapter 18
Rollback Strategies
Chapter 20 →
Disaster Recovery