Load Balancing — Local AI Clusters (Chapter 13)

Load balancing across inference endpoints determines latency, throughput, and resource utilization for AI serving workloads.

Ingress Controller Deployment

NGINX Ingress Controller handles HTTP/HTTPS traffic routing and load distribution:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress \
  --create-namespace \
  --set controller.publishService.enabled=true \
  --set controller.metrics.enabled=true \
  --set controller.metrics.serviceMonitor.enabled=true

Configure an Ingress resource to route inference requests:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: inference-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10G"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
  rules:
  - host: inference.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: inference-service
            port:
              number: 8080

Service Mesh for Advanced Routing

Istio provides traffic management with better observability and intelligent routing for model serving:

# Install Istio operator
istioctl operator init

# Deploy Istio control plane
istioctl install --set values.cni.enabled=true -y

# Enable sidecar injection for inference namespace
kubectl label namespace inference istio-injection=enabled

Istio's VirtualService supports weighted traffic splitting for canary deployments:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: inference-vs
spec:
  hosts:
  - inference.local
  http:
  - route:
    - destination:
        host: inference-v2
        subset: stable
      weight: 90
    - destination:
        host: inference-v3
        subset: canary
      weight: 10

Traffic weight shifting enables gradual rollout of new model versions with automatic rollback on error rate thresholds.

Health Checks and Failover

Configure readiness probes to exclude unhealthy replicas from load balancing:

apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  selector:
    app: llama-inference
  ports:
  - port: 8080
  sessionAffinity: None
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: inference
        image: ghcr.io/gventroulingenAI/llama.cpp:latest
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        resources:
          limits:
            nvidia.com/gpu: 1

Without proper health checks, requests route to pods still loading model weights, producing timeouts.

Connection Draining

For long-running inference requests, graceful connection draining prevents request failures during updates:

# Istio destination rule with connection draining
cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: inference-dr
spec:
  host: inference-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
EOF