HOW-TO · SUP

How to set up auto-scaling for LLM inference

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes or cloud platform, inference service deployed

What this does

Setting up auto-scaling for LLM inference ensures inference services dynamically adjust capacity based on demand—scaling up during traffic spikes and scaling down during idle periods to control costs. The auto-scaling configuration uses request queue depth, GPU utilization, and response latency as scaling metrics. This avoids cold-start delays during usage surges while preventing resource waste from idle GPU instances.

Steps

Define the scaling metric strategy. The most reliable approach combines two metrics: request queue depth (number of pending inference requests) and GPU memory utilization. In Kubernetes, create a ConfigMap with the inference service exposing a custom metric: inference_queue_depth via a Prometheus exporter sidecar. Install the Prometheus adapter: helm install prometheus-adapter prometheus-community/prometheus-adapter. Create an HPA manifest referencing the custom metric: metrics: [{type: Pods, pods: {metric: {name: inference_queue_depth}, target: {type: AverageValue, averageValue: "5"}}}]. Set minimum replicas to 1 (or 0 for serverless) and maximum based on GPU node count. For GPU memory-based scaling, add a second metric: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.85 triggers scale-up. Implement scale-down stabilization with behavior.scaleDown.stabilizationWindowSeconds: 300 to prevent thrashing. For serverless platforms, configure concurrency limits per instance (typically 1-4 concurrent requests per GPU) and set idle timeout to 300 seconds before scaling to zero. Test the configuration by running a load generation script.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Verify baseline: send a single inference request and confirm 1 pod is running. Generate load with locust or hey: run 100 concurrent requests over 60 seconds and watch kubectl get hpa -w—replicas should increase within 60 seconds. After the load stops, wait 5 minutes and verify replicas scale down to the minimum. Check that no requests received 5xx errors during the scale-up transition. Run kubectl describe hpa and verify scale events are logged with timestamps and metric values. Test scale-to-zero (if configured): after idle timeout, confirm zero pods are running, then send a request and verify a cold start completes within 30 seconds.

Common failures

HPA cannot read custom metrics: Check the Prometheus adapter logs with kubectl logs -n monitoring deployment/prometheus-adapter and verify the metric name matches exactly. Scale-up too slow causing request timeouts: Reduce the metrics polling interval (default 15s) or pre-warm with minimum replicas of 2. Scale-down too aggressive: Increase stabilizationWindowSeconds to 600 and add policies with slower scale-down rate. GPU lock contention with multiple replicas: Use GPU time-slicing or MIG (Multi-Instance GPU) to share GPUs across pods. Cold start delays on scale-from-zero: Bake model weights into the container image or use a model cache daemonset to pre-load weights on GPU nodes.

  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • deploy-ai-kubernetes-gpu-nodes
  • build-multi-tenant-ai-serving
  • monitor-agent-token-usage-cost