How to implement predictive autoscaling for AI workloads using historical patterns
Historical metrics data, KEDA or custom scaler
What this does
This guide implements predictive autoscaling for AI inference and training workloads by analyzing historical request patterns and pre-warming compute capacity before demand peaks. Unlike reactive HPA, which waits for metrics to cross a threshold, predictive scaling uses a Prometheus recording rule that computes the forecast using linear regression on the past 4 weeks of hourly traffic. A custom scaler or KEDA cron trigger then schedules capacity increases for known peak periods.
Steps
Verify historical request data exists in Prometheus:
curl -s "http://prometheus:9090/api/v1/query?query=ai_requests_total[4w]" | jq '.data.result[0].values | length'Expected output: a number > 0 confirming 4 weeks of data is available.
Create a Prometheus recording rule that computes the forecast. In
predictive-rules.yml:groups: - name: predictive_scaling interval: 1h rules: - record: forecast:ai_request_rate_1h expr: | predict_linear( rate(ai_requests_total[1h])[4w:1h], 3600 )The
predict_linearfunction projects the request rate one hour into the future based on the 4-week trend.Load the recording rule into Prometheus and verify:
promtool check rules predictive-rules.yml curl -X POST http://prometheus:9090/-/reload curl -s "http://prometheus:9090/api/v1/query?query=forecast:ai_request_rate_1h" | jq '.data.result[0].value[1]'Expected output: a numeric forecast value.
Expose the forecast metric to Kubernetes custom metrics via the Prometheus adapter. Add to
adapter-config.yml:rules: custom: - seriesQuery: 'forecast:ai_request_rate_1h' metricsQuery: avg(forecast:ai_request_rate_1h)Create a KEDA ScaledObject with both a cron trigger (for known peak hours) and a Prometheus trigger (for the forecast):
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: ai-predictive-scaler spec: scaleTargetRef: name: ai-inference minReplicaCount: 1 maxReplicaCount: 10 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: forecast_ai_request_rate_1h query: forecast:ai_request_rate_1h threshold: "50" - type: cron metadata: timezone: America/New_York start: 30 8 * * 1-5 end: 30 17 * * 1-5 desiredReplicas: "5"Apply the KEDA ScaledObject:
kubectl apply -f predictive-scaler.yaml kubectl get scaledobject ai-predictive-scalerExpected output:
READY Trueconfirming the scaler is active.Add a cooldown period to prevent rapid scale-down after peaks. Configure in the ScaledObject:
advanced: horizontalPodAutoscalerConfig: behavior: scaleDown: stabilizationWindowSeconds: 600 policies: - type: Percent value: 50 periodSeconds: 60Monitor the predictive scaler during a known peak. At the scheduled cron time (8:30 AM on weekdays), observe:
kubectl get hpa keda-hpa-ai-predictive-scaler -wExpected: the HPA target replicas increase to 5 before the traffic spike arrives.
Verification
kubectl get scaledobject ai-predictive-scaler -o json | jq '.status.externalMetricNames'
Expected output: the list of external metric names being used by KEDA (e.g., ["prometheus-forecast_ai_request_rate_1h", "cron-...-..."]).
Common failures
- predict_linear returns NaN — the recording rule requires at least 2 data points in the range vector. If the AI service was deployed less than 2 hours ago, the 4-week window is empty. Check with
count_over_time(ai_requests_total[4w]). - KEDA cron timezone mismatch — the cron expression uses the specified timezone. Verify the cluster's correct timezone or use UTC and adjust start/end times accordingly.
- HPA and KEDA conflict — ensure no separate HPA targets the same Deployment. KEDA manages its own internal HPA. If an existing HPA exists, delete it:
kubectl delete hpa <name>before applying the ScaledObject.