How to set up auto-scaling for LLM inference
Kubernetes or cloud platform, inference service deployed
What this does
Setting up auto-scaling for LLM inference ensures inference services dynamically adjust capacity based on demand—scaling up during traffic spikes and scaling down during idle periods to control costs. The auto-scaling configuration uses request queue depth, GPU utilization, and response latency as scaling metrics. This avoids cold-start delays during usage surges while preventing resource waste from idle GPU instances.
Steps
Define the scaling metric strategy. The most reliable approach combines two metrics: request queue depth (number of pending inference requests) and GPU memory utilization. In Kubernetes, create a ConfigMap with the inference service exposing a custom metric: inference_queue_depth via a Prometheus exporter sidecar. Install the Prometheus adapter: helm install prometheus-adapter prometheus-community/prometheus-adapter. Create an HPA manifest referencing the custom metric: metrics: [{type: Pods, pods: {metric: {name: inference_queue_depth}, target: {type: AverageValue, averageValue: "5"}}}]. Set minimum replicas to 1 (or 0 for serverless) and maximum based on GPU node count. For GPU memory-based scaling, add a second metric: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.85 triggers scale-up. Implement scale-down stabilization with behavior.scaleDown.stabilizationWindowSeconds: 300 to prevent thrashing. For serverless platforms, configure concurrency limits per instance (typically 1-4 concurrent requests per GPU) and set idle timeout to 300 seconds before scaling to zero. Test the configuration by running a load generation script.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Verify baseline: send a single inference request and confirm 1 pod is running. Generate load with locust or hey: run 100 concurrent requests over 60 seconds and watch kubectl get hpa -w—replicas should increase within 60 seconds. After the load stops, wait 5 minutes and verify replicas scale down to the minimum. Check that no requests received 5xx errors during the scale-up transition. Run kubectl describe hpa and verify scale events are logged with timestamps and metric values. Test scale-to-zero (if configured): after idle timeout, confirm zero pods are running, then send a request and verify a cold start completes within 30 seconds.
Common failures
HPA cannot read custom metrics: Check the Prometheus adapter logs with kubectl logs -n monitoring deployment/prometheus-adapter and verify the metric name matches exactly. Scale-up too slow causing request timeouts: Reduce the metrics polling interval (default 15s) or pre-warm with minimum replicas of 2. Scale-down too aggressive: Increase stabilizationWindowSeconds to 600 and add policies with slower scale-down rate. GPU lock contention with multiple replicas: Use GPU time-slicing or MIG (Multi-Instance GPU) to share GPUs across pods. Cold start delays on scale-from-zero: Bake model weights into the container image or use a model cache daemonset to pre-load weights on GPU nodes.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- deploy-ai-kubernetes-gpu-nodes
- build-multi-tenant-ai-serving
- monitor-agent-token-usage-cost