How to monitor AI agent error rates by error classification
Error tracking instrumented in agent, Prometheus
What this does
This guide categorizes AI agent errors into a taxonomy (model timeout, tool execution failure, rate limit, invalid output format, and retry exhaustion) and exposes each category as a Prometheus counter. Operators can then alert on specific error classes independently, avoiding alert fatigue from generic error-rate thresholds. The classification runs inside the agent's exception handler, inspecting exception types and API status codes.
Steps
Define an error taxonomy as an enum and create labelled counters in
errors.py:from prometheus_client import Counter from enum import Enum class ErrorClass(Enum): MODEL_TIMEOUT = "model_timeout" TOOL_FAILURE = "tool_failure" RATE_LIMIT = "rate_limit" INVALID_OUTPUT = "invalid_output" RETRY_EXHAUSTION = "retry_exhaustion" UNKNOWN = "unknown" agent_errors = Counter( "ai_agent_errors_total", "AI agent errors by classification", ["error_class", "agent_id"] )In the agent's main exception handler, classify and count each error:
try: result = agent.run(task) except asyncio.TimeoutError: agent_errors.labels(error_class=ErrorClass.MODEL_TIMEOUT.value, agent_id=agent.name).inc() except ToolExecutionError as e: agent_errors.labels(error_class=ErrorClass.TOOL_FAILURE.value, agent_id=agent.name).inc() except RateLimitError: agent_errors.labels(error_class=ErrorClass.RATE_LIMIT.value, agent_id=agent.name).inc() except OutputValidationError: agent_errors.labels(error_class=ErrorClass.INVALID_OUTPUT.value, agent_id=agent.name).inc() except MaxRetriesExceeded: agent_errors.labels(error_class=ErrorClass.RETRY_EXHAUSTION.value, agent_id=agent.name).inc() except Exception: agent_errors.labels(error_class=ErrorClass.UNKNOWN.value, agent_id=agent.name).inc() raiseExpose the metrics endpoint. Confirm counters appear:
curl -s http://localhost:8000/metrics | grep ai_agent_errors_totalExpected output: at least six lines, one per error class, with initial value 0.
In Grafana, create a stacked bar chart panel showing errors per classification:
sum by (error_class) (rate(ai_agent_errors_total[5m]))Configure an alert for model timeout spikes specifically:
- alert: ModelTimeoutSpike expr: rate(ai_agent_errors_total{error_class="model_timeout"}[5m]) > 0.1 for: 3m annotations: summary: "Model timeout rate exceeds 0.1/sec"Create a separate lower-severity alert for tool failures to avoid conflating infrastructure issues with agent logic bugs:
- alert: ToolFailureRate expr: rate(ai_agent_errors_total{error_class="tool_failure"}[10m]) > 0.05 severity: warningPeriodically review the
unknownerror counter. A rising unknown count indicates new failure modes that need explicit classification.
Verification
curl -s http://localhost:8000/metrics | grep 'ai_agent_errors_total{' | wc -l
Expected output: 6 (one per error class defined in the taxonomy).
Common failures
- Counters stuck at zero — confirm the exception handler wraps the entire agent run loop. Errors raised outside the try/except block (e.g., in thread pools) are not counted. Use
sys.excepthookfor unhandled exceptions. - Duplicate error classes in output — restarted processes with the same metric name clash with Prometheus's default registry. Use a dedicated
CollectorRegistryinstance. - Grafana panel shows zero rate — the
rate()function requires at least two data points within the time window. Trigger several errors and wait for the next scrape interval (default 15 seconds).