What this does

This guide categorizes AI agent errors into a taxonomy (model timeout, tool execution failure, rate limit, invalid output format, and retry exhaustion) and exposes each category as a Prometheus counter. Operators can then alert on specific error classes independently, avoiding alert fatigue from generic error-rate thresholds. The classification runs inside the agent's exception handler, inspecting exception types and API status codes.

Steps

Define an error taxonomy as an enum and create labelled counters in errors.py:

from prometheus_client import Counter
from enum import Enum

class ErrorClass(Enum):
    MODEL_TIMEOUT = "model_timeout"
    TOOL_FAILURE = "tool_failure"
    RATE_LIMIT = "rate_limit"
    INVALID_OUTPUT = "invalid_output"
    RETRY_EXHAUSTION = "retry_exhaustion"
    UNKNOWN = "unknown"

agent_errors = Counter(
    "ai_agent_errors_total",
    "AI agent errors by classification",
    ["error_class", "agent_id"]
)

In the agent's main exception handler, classify and count each error:

try:
    result = agent.run(task)
except asyncio.TimeoutError:
    agent_errors.labels(error_class=ErrorClass.MODEL_TIMEOUT.value, agent_id=agent.name).inc()
except ToolExecutionError as e:
    agent_errors.labels(error_class=ErrorClass.TOOL_FAILURE.value, agent_id=agent.name).inc()
except RateLimitError:
    agent_errors.labels(error_class=ErrorClass.RATE_LIMIT.value, agent_id=agent.name).inc()
except OutputValidationError:
    agent_errors.labels(error_class=ErrorClass.INVALID_OUTPUT.value, agent_id=agent.name).inc()
except MaxRetriesExceeded:
    agent_errors.labels(error_class=ErrorClass.RETRY_EXHAUSTION.value, agent_id=agent.name).inc()
except Exception:
    agent_errors.labels(error_class=ErrorClass.UNKNOWN.value, agent_id=agent.name).inc()
    raise

Expose the metrics endpoint. Confirm counters appear:
```
curl -s http://localhost:8000/metrics | grep ai_agent_errors_total
```
Expected output: at least six lines, one per error class, with initial value 0.
In Grafana, create a stacked bar chart panel showing errors per classification:
```
sum by (error_class) (rate(ai_agent_errors_total[5m]))
```

Configure an alert for model timeout spikes specifically:

- alert: ModelTimeoutSpike
  expr: rate(ai_agent_errors_total{error_class="model_timeout"}[5m]) > 0.1
  for: 3m
  annotations:
    summary: "Model timeout rate exceeds 0.1/sec"

Create a separate lower-severity alert for tool failures to avoid conflating infrastructure issues with agent logic bugs:

- alert: ToolFailureRate
  expr: rate(ai_agent_errors_total{error_class="tool_failure"}[10m]) > 0.05
  severity: warning

Periodically review the unknown error counter. A rising unknown count indicates new failure modes that need explicit classification.

Verification

curl -s http://localhost:8000/metrics | grep 'ai_agent_errors_total{' | wc -l

Expected output: 6 (one per error class defined in the taxonomy).

Common failures

Counters stuck at zero — confirm the exception handler wraps the entire agent run loop. Errors raised outside the try/except block (e.g., in thread pools) are not counted. Use sys.excepthook for unhandled exceptions.
Duplicate error classes in output — restarted processes with the same metric name clash with Prometheus's default registry. Use a dedicated CollectorRegistry instance.
Grafana panel shows zero rate — the rate() function requires at least two data points within the time window. Trigger several errors and wait for the next scrape interval (default 15 seconds).

How to monitor AI agent error rates by error classification

What this does

Steps

Verification

Common failures

Related guides