RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to monitor AI agent error rates by error classification
HOW-TO · OPS

How to monitor AI agent error rates by error classification

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Error tracking instrumented in agent, Prometheus

What this does

This guide categorizes AI agent errors into a taxonomy (model timeout, tool execution failure, rate limit, invalid output format, and retry exhaustion) and exposes each category as a Prometheus counter. Operators can then alert on specific error classes independently, avoiding alert fatigue from generic error-rate thresholds. The classification runs inside the agent's exception handler, inspecting exception types and API status codes.

Steps

  1. Define an error taxonomy as an enum and create labelled counters in errors.py:

    from prometheus_client import Counter
    from enum import Enum
    
    class ErrorClass(Enum):
        MODEL_TIMEOUT = "model_timeout"
        TOOL_FAILURE = "tool_failure"
        RATE_LIMIT = "rate_limit"
        INVALID_OUTPUT = "invalid_output"
        RETRY_EXHAUSTION = "retry_exhaustion"
        UNKNOWN = "unknown"
    
    agent_errors = Counter(
        "ai_agent_errors_total",
        "AI agent errors by classification",
        ["error_class", "agent_id"]
    )
    
  2. In the agent's main exception handler, classify and count each error:

    try:
        result = agent.run(task)
    except asyncio.TimeoutError:
        agent_errors.labels(error_class=ErrorClass.MODEL_TIMEOUT.value, agent_id=agent.name).inc()
    except ToolExecutionError as e:
        agent_errors.labels(error_class=ErrorClass.TOOL_FAILURE.value, agent_id=agent.name).inc()
    except RateLimitError:
        agent_errors.labels(error_class=ErrorClass.RATE_LIMIT.value, agent_id=agent.name).inc()
    except OutputValidationError:
        agent_errors.labels(error_class=ErrorClass.INVALID_OUTPUT.value, agent_id=agent.name).inc()
    except MaxRetriesExceeded:
        agent_errors.labels(error_class=ErrorClass.RETRY_EXHAUSTION.value, agent_id=agent.name).inc()
    except Exception:
        agent_errors.labels(error_class=ErrorClass.UNKNOWN.value, agent_id=agent.name).inc()
        raise
    
  3. Expose the metrics endpoint. Confirm counters appear:

    curl -s http://localhost:8000/metrics | grep ai_agent_errors_total
    

    Expected output: at least six lines, one per error class, with initial value 0.

  4. In Grafana, create a stacked bar chart panel showing errors per classification:

    sum by (error_class) (rate(ai_agent_errors_total[5m]))
    
  5. Configure an alert for model timeout spikes specifically:

    - alert: ModelTimeoutSpike
      expr: rate(ai_agent_errors_total{error_class="model_timeout"}[5m]) > 0.1
      for: 3m
      annotations:
        summary: "Model timeout rate exceeds 0.1/sec"
    
  6. Create a separate lower-severity alert for tool failures to avoid conflating infrastructure issues with agent logic bugs:

    - alert: ToolFailureRate
      expr: rate(ai_agent_errors_total{error_class="tool_failure"}[10m]) > 0.05
      severity: warning
    
  7. Periodically review the unknown error counter. A rising unknown count indicates new failure modes that need explicit classification.

Verification

curl -s http://localhost:8000/metrics | grep 'ai_agent_errors_total{' | wc -l

Expected output: 6 (one per error class defined in the taxonomy).

Common failures

  • Counters stuck at zero — confirm the exception handler wraps the entire agent run loop. Errors raised outside the try/except block (e.g., in thread pools) are not counted. Use sys.excepthook for unhandled exceptions.
  • Duplicate error classes in output — restarted processes with the same metric name clash with Prometheus's default registry. Use a dedicated CollectorRegistry instance.
  • Grafana panel shows zero rate — the rate() function requires at least two data points within the time window. Trigger several errors and wait for the next scrape interval (default 15 seconds).

Related guides

  • Instrument a Python FastAPI AI service with Prometheus metrics
  • Set up Prometheus alerting rules for AI service degradation
  • Monitor AI agent token usage and cost in real-time using Prometheus counters
← All how-to guidesCourses →