RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 13
Local AI Clusters

13. Load Balancing

Chapter 13 of 18 · 20 min
KEY INSIGHT

Load balancing for inference workloads requires awareness of request duration, model loading times, and GPU memory constraints. Health check configuration directly impacts failure rates, and connection draining becomes essential when pods require graceful shutdown before termination.

Load balancing across inference endpoints determines latency, throughput, and resource utilization for AI serving workloads.

Ingress Controller Deployment

NGINX Ingress Controller handles HTTP/HTTPS traffic routing and load distribution:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress \
  --create-namespace \
  --set controller.publishService.enabled=true \
  --set controller.metrics.enabled=true \
  --set controller.metrics.serviceMonitor.enabled=true

Configure an Ingress resource to route inference requests:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: inference-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10G"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
  rules:
  - host: inference.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: inference-service
            port:
              number: 8080

Service Mesh for Advanced Routing

Istio provides traffic management with better observability and intelligent routing for model serving:

# Install Istio operator
istioctl operator init

# Deploy Istio control plane
istioctl install --set values.cni.enabled=true -y

# Enable sidecar injection for inference namespace
kubectl label namespace inference istio-injection=enabled

Istio's VirtualService supports weighted traffic splitting for canary deployments:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: inference-vs
spec:
  hosts:
  - inference.local
  http:
  - route:
    - destination:
        host: inference-v2
        subset: stable
      weight: 90
    - destination:
        host: inference-v3
        subset: canary
      weight: 10

Traffic weight shifting enables gradual rollout of new model versions with automatic rollback on error rate thresholds.

Health Checks and Failover

Configure readiness probes to exclude unhealthy replicas from load balancing:

apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  selector:
    app: llama-inference
  ports:
  - port: 8080
  sessionAffinity: None
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: inference
        image: ghcr.io/gventroulingenAI/llama.cpp:latest
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        resources:
          limits:
            nvidia.com/gpu: 1

Without proper health checks, requests route to pods still loading model weights, producing timeouts.

Connection Draining

For long-running inference requests, graceful connection draining prevents request failures during updates:

# Istio destination rule with connection draining
cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: inference-dr
spec:
  host: inference-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
EOF

EXERCISE

Deploy two inference replicas behind an NGINX Ingress, configure readiness probes, then deliberately crash one pod and observe the ingress behavior using curl -v on each request.

← Chapter 12
Model Repository
Chapter 14 →
Cluster Monitoring