HOW-TO · OPS

How to set up Prometheus alerting rules for AI service degradation

intermediate20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Prometheus configured, AI service metrics to monitor

What this does

This guide configures Prometheus alerting rules that detect AI service degradation across four dimensions: increased error rate, elevated latency, token cost anomalies, and model availability drops. Each alert is tuned for AI workloads — for example, latency alerts use percentile thresholds rather than averages, and error alerts distinguish between client errors (4xx) and server errors (5xx). All alerts route through Alertmanager to the appropriate notification channel.

Steps

  1. Create an alert rules file at /etc/prometheus/rules/ai-degradation.yml:

    groups:
      - name: ai-service-degradation
    
  2. Add a rule for elevated error rate. This fires when more than 5% of requests fail over 5 minutes:

        - alert: AIHighErrorRate
          expr: |
            sum(rate(ai_requests_total{status="error"}[5m]))
            / sum(rate(ai_requests_total[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "AI service error rate above 5%"
            description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
    
  3. Add latency degradation alerts using histogram quantiles:

        - alert: AIHighP99Latency
          expr: |
            histogram_quantile(0.99,
              rate(ai_request_duration_seconds_bucket[5m])
            ) > 30
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "AI p99 latency exceeds 30 seconds"
    
  4. Add token cost anomaly detection:

        - alert: AITokenCostSpike
          expr: |
            rate(ai_inference_cost_cents_total[1h])
            / rate(ai_inference_cost_cents_total[1h] offset 24h) > 3
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "AI token cost 3x higher than same hour yesterday"
    
  5. Add model unavailability detection:

        - alert: AIModelDown
          expr: up{job="ai-inference"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "AI model endpoint is down"
            description: "Instance {{ $labels.instance }} has been down for 2 minutes."
    
  6. Validate the rules file:

    promtool check rules ai-degradation.yml
    

    Expected output: SUCCESS: 4 rules found.

  7. Load the rules into Prometheus by referencing them in prometheus.yml:

    rule_files:
      - "/etc/prometheus/rules/ai-degradation.yml"
    
  8. Reload Prometheus and verify alerts appear in the Alerts page at http://localhost:9090/alerts. Each rule should show as inactive (green).

Verification

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="ai-service-degradation") | .rules | length'

Expected output: 4.

Common failures

  • Alert fires continuously after deploy — the for duration is too short. Increase to 10-15 minutes to avoid flapping from transient errors during rollout.
  • Latency alert never fires despite slow responses — the histogram buckets may not extend high enough. Add buckets at 60, 120, and 300 seconds.
  • Token cost comparison returns NaN — the 24h offset requires at least 25 hours of metric history. Disable the alert or use a shorter offset during initial deployment.
  • Alertmanager does not receive alerts — check the Alertmanager tab in Prometheus UI (Status > Runtime & Build Information) for alertmanagers discovered. Verify the URL in prometheus.yml under alerting > alertmanagers.

Related guides