13. Load Balancing
Load balancing across inference endpoints determines latency, throughput, and resource utilization for AI serving workloads.
Ingress Controller Deployment
NGINX Ingress Controller handles HTTP/HTTPS traffic routing and load distribution:
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress \
--create-namespace \
--set controller.publishService.enabled=true \
--set controller.metrics.enabled=true \
--set controller.metrics.serviceMonitor.enabled=true
Configure an Ingress resource to route inference requests:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: inference-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10G"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
rules:
- host: inference.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: inference-service
port:
number: 8080
Service Mesh for Advanced Routing
Istio provides traffic management with better observability and intelligent routing for model serving:
# Install Istio operator
istioctl operator init
# Deploy Istio control plane
istioctl install --set values.cni.enabled=true -y
# Enable sidecar injection for inference namespace
kubectl label namespace inference istio-injection=enabled
Istio's VirtualService supports weighted traffic splitting for canary deployments:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: inference-vs
spec:
hosts:
- inference.local
http:
- route:
- destination:
host: inference-v2
subset: stable
weight: 90
- destination:
host: inference-v3
subset: canary
weight: 10
Traffic weight shifting enables gradual rollout of new model versions with automatic rollback on error rate thresholds.
Health Checks and Failover
Configure readiness probes to exclude unhealthy replicas from load balancing:
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
selector:
app: llama-inference
ports:
- port: 8080
sessionAffinity: None
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-inference
spec:
replicas: 3
template:
spec:
containers:
- name: inference
image: ghcr.io/gventroulingenAI/llama.cpp:latest
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
resources:
limits:
nvidia.com/gpu: 1
Without proper health checks, requests route to pods still loading model weights, producing timeouts.
Connection Draining
For long-running inference requests, graceful connection draining prevents request failures during updates:
# Istio destination rule with connection draining
cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: inference-dr
spec:
host: inference-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 60s
EOF
Deploy two inference replicas behind an NGINX Ingress, configure readiness probes, then deliberately crash one pod and observe the ingress behavior using curl -v on each request.