RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to monitor AI agent token usage and cost in real-time using Prometheus counters
HOW-TO · OPS

How to monitor AI agent token usage and cost in real-time using Prometheus counters

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Prometheus running, AI API endpoint with token metadata

What this does

This guide creates Prometheus counter metrics that track total input tokens, output tokens, and estimated API cost for every AI model inference call. The counters accumulate across all agent interactions, enabling real-time cost dashboards and budget alerting. Token metadata is extracted from API response headers and converted to monetary values using per-model pricing tables.

Steps

  1. Install the Prometheus Python client:

    pip install prometheus-client
    

    Expected output: Successfully installed prometheus-client-0.19.0.

  2. Define token counters in a new metrics.py module:

    from prometheus_client import Counter, generate_latest, CollectorRegistry
    registry = CollectorRegistry()
    token_input = Counter("ai_token_input_total", "Input tokens consumed", ["model"], registry=registry)
    token_output = Counter("ai_token_output_total", "Output tokens generated", ["model"], registry=registry)
    cost_total = Counter("ai_inference_cost_cents_total", "Estimated cost in cents", ["model"], registry=registry)
    
  3. Hook the counters into the LLM call path. After each API response, extract token counts and increment:

    input_tokens = response.get("usage", {}).get("prompt_tokens", 0)
    output_tokens = response.get("usage", {}).get("completion_tokens", 0)
    token_input.labels(model=model_name).inc(input_tokens)
    token_output.labels(model=model_name).inc(output_tokens)
    cost_per_1k = PRICING[model_name]
    cost_cents = (input_tokens * cost_per_1k["input"] + output_tokens * cost_per_1k["output"]) / 1000 * 100
    cost_total.labels(model=model_name).inc(cost_cents)
    
  4. Define a per-model pricing dictionary at the top of the module:

    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "claude-3.5-sonnet": {"input": 3.00, "output": 15.00},
    }
    
  5. Expose metrics on a dedicated HTTP endpoint. In the FastAPI or Flask app, add a /metrics route:

    @app.get("/metrics")
    def metrics():
        return Response(content=generate_latest(registry), media_type="text/plain")
    
  6. Add the metrics endpoint as a Prometheus scrape target in prometheus.yml:

    scrape_configs:
      - job_name: "ai-agent"
        static_configs:
          - targets: ["localhost:8000"]
    
  7. Reload Prometheus configuration:

    curl -X POST http://localhost:9090/-/reload
    

    Expected output: 200 OK.

  8. Verify token counters are being scraped by querying ai_token_input_total in the Prometheus expression browser at http://localhost:9090/graph.

Verification

curl -s http://localhost:8000/metrics | grep "ai_token_input_total"

Expected output: a line like ai_token_input_total{model="gpt-4o"} 12345.0 confirming the counter is exposed.

Common failures

  • Counters return empty — ensure the LLM response includes a usage field. Test with print(response.get("usage")) to confirm the structure matches expectations.
  • Prometheus scrape target down — verify the agent service is listening on the correct port with ss -tlnp | grep 8000.
  • Cost calculation produces zero — confirm the model name returned by the API exactly matches a key in the PRICING dictionary.
  • Duplicated metrics on reload — use a module-level CollectorRegistry rather than the default registry to avoid duplicate registration errors.

Related guides

  • Prometheus alerting rules for AI service degradation
  • Instrument a Python FastAPI AI service with Prometheus metrics
  • Create SLO burn rate alerts for AI agent availability
← All how-to guidesCourses →