RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to implement predictive autoscaling for AI workloads using historical patterns
HOW-TO · OPS

How to implement predictive autoscaling for AI workloads using historical patterns

advanced·35 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Historical metrics data, KEDA or custom scaler

What this does

This guide implements predictive autoscaling for AI inference and training workloads by analyzing historical request patterns and pre-warming compute capacity before demand peaks. Unlike reactive HPA, which waits for metrics to cross a threshold, predictive scaling uses a Prometheus recording rule that computes the forecast using linear regression on the past 4 weeks of hourly traffic. A custom scaler or KEDA cron trigger then schedules capacity increases for known peak periods.

Steps

  1. Verify historical request data exists in Prometheus:

    curl -s "http://prometheus:9090/api/v1/query?query=ai_requests_total[4w]" | jq '.data.result[0].values | length'
    

    Expected output: a number > 0 confirming 4 weeks of data is available.

  2. Create a Prometheus recording rule that computes the forecast. In predictive-rules.yml:

    groups:
      - name: predictive_scaling
        interval: 1h
        rules:
          - record: forecast:ai_request_rate_1h
            expr: |
              predict_linear(
                rate(ai_requests_total[1h])[4w:1h],
                3600
              )
    

    The predict_linear function projects the request rate one hour into the future based on the 4-week trend.

  3. Load the recording rule into Prometheus and verify:

    promtool check rules predictive-rules.yml
    curl -X POST http://prometheus:9090/-/reload
    curl -s "http://prometheus:9090/api/v1/query?query=forecast:ai_request_rate_1h" | jq '.data.result[0].value[1]'
    

    Expected output: a numeric forecast value.

  4. Expose the forecast metric to Kubernetes custom metrics via the Prometheus adapter. Add to adapter-config.yml:

    rules:
      custom:
        - seriesQuery: 'forecast:ai_request_rate_1h'
          metricsQuery: avg(forecast:ai_request_rate_1h)
    
  5. Create a KEDA ScaledObject with both a cron trigger (for known peak hours) and a Prometheus trigger (for the forecast):

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: ai-predictive-scaler
    spec:
      scaleTargetRef:
        name: ai-inference
      minReplicaCount: 1
      maxReplicaCount: 10
      triggers:
        - type: prometheus
          metadata:
            serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
            metricName: forecast_ai_request_rate_1h
            query: forecast:ai_request_rate_1h
            threshold: "50"
        - type: cron
          metadata:
            timezone: America/New_York
            start: 30 8 * * 1-5
            end: 30 17 * * 1-5
            desiredReplicas: "5"
    
  6. Apply the KEDA ScaledObject:

    kubectl apply -f predictive-scaler.yaml
    kubectl get scaledobject ai-predictive-scaler
    

    Expected output: READY True confirming the scaler is active.

  7. Add a cooldown period to prevent rapid scale-down after peaks. Configure in the ScaledObject:

    advanced:
      horizontalPodAutoscalerConfig:
        behavior:
          scaleDown:
            stabilizationWindowSeconds: 600
            policies:
              - type: Percent
                value: 50
                periodSeconds: 60
    
  8. Monitor the predictive scaler during a known peak. At the scheduled cron time (8:30 AM on weekdays), observe:

    kubectl get hpa keda-hpa-ai-predictive-scaler -w
    

    Expected: the HPA target replicas increase to 5 before the traffic spike arrives.

Verification

kubectl get scaledobject ai-predictive-scaler -o json | jq '.status.externalMetricNames'

Expected output: the list of external metric names being used by KEDA (e.g., ["prometheus-forecast_ai_request_rate_1h", "cron-...-..."]).

Common failures

  • predict_linear returns NaN — the recording rule requires at least 2 data points in the range vector. If the AI service was deployed less than 2 hours ago, the 4-week window is empty. Check with count_over_time(ai_requests_total[4w]).
  • KEDA cron timezone mismatch — the cron expression uses the specified timezone. Verify the cluster's correct timezone or use UTC and adjust start/end times accordingly.
  • HPA and KEDA conflict — ensure no separate HPA targets the same Deployment. KEDA manages its own internal HPA. If an existing HPA exists, delete it: kubectl delete hpa <name> before applying the ScaledObject.

Related guides

  • Scale AI services based on request queue depth with KEDA
  • Horizontal pod autoscaling for AI inference services
  • Implement GPU-based autoscaling with custom metrics
← All how-to guidesCourses →