13. Load Balancing

Chapter 13 of 24 · 20 min

KEY INSIGHT

Load balancers for inference serving must route based on predicted compute cost, not request volume, because identical request counts can produce dramatically different computational loads. ### Traffic Distribution Patterns The `nginx` load balancer handles inference routing with upstream blocks that track backend health: ```nginx upstream inference_cluster { least_conn; server model-server-1:8000 weight=3; server model-server-2:8000 weight=3; server model-server-3:8000 weight=2; keepalive 32; } server { listen 443 ssl; location /predict { proxy_pass http://inference_cluster; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header X-Request-Length $request_length; proxy_connect_timeout 300s; proxy_read_timeout 300s; } } ``` The `least_conn` directive routes new requests to the backend with the fewest active connections, which provides better distribution for variable-latency inference workloads than round-robin algorithms. ### gRPC Load Balancing Considerations gRPC's HTTP/2 multiplexing complicates load balancing because a single TCP connection carries multiple streams. Solution: implement grpclb with client-side load balancing: ```python import grpc from grpc_lb import load_balancer balancer = load_balancer.Resolver( target="inference-cluster.consul:8001", lb_policy="round_robin" ) channel = grpc.insecure_channel( balancer.target(), options=[ ('grpc.lb_policy_name', 'round_robin'), ('grpc.service_config', '{"loadBalancingConfig":[{"round_robin":{}}]}') ] ) ``` ### Health Check Configuration Effective health checks prevent routing requests to failing or overloaded model servers: ```yaml health_check: enabled: true interval: 5s timeout: 3s healthy_threshold: 2 unhealthy_threshold: 3 # Inference-specific checks check: path: /health expected_status: 200 expected_response: "OK" # Abort if GPU memory exceeds threshold abort_on: gpu_memory_percent: 95 queue_depth: 100 ```

Inference workloads present unique load balancing challenges that differ from traditional web traffic patterns. Model inference latency varies dramatically based on input size, model complexity, and available GPU memory, making simple round-robin approaches ineffective. Intelligent load distribution requires metrics beyond request counts.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Configure nginx as a load balancer for a three-node inference cluster. Implement custom health checks that verify GPU availability and request queue depth. Test failure scenarios by stopping individual servers and confirming traffic redirects within five seconds. Measure average latency distribution across nodes to verify even load distribution.