What this does

This guide creates Prometheus counter metrics that track total input tokens, output tokens, and estimated API cost for every AI model inference call. The counters accumulate across all agent interactions, enabling real-time cost dashboards and budget alerting. Token metadata is extracted from API response headers and converted to monetary values using per-model pricing tables.

Steps

Install the Prometheus Python client:
```
pip install prometheus-client
```
Expected output: Successfully installed prometheus-client-0.19.0.

Define token counters in a new metrics.py module:

from prometheus_client import Counter, generate_latest, CollectorRegistry
registry = CollectorRegistry()
token_input = Counter("ai_token_input_total", "Input tokens consumed", ["model"], registry=registry)
token_output = Counter("ai_token_output_total", "Output tokens generated", ["model"], registry=registry)
cost_total = Counter("ai_inference_cost_cents_total", "Estimated cost in cents", ["model"], registry=registry)

Hook the counters into the LLM call path. After each API response, extract token counts and increment:

input_tokens = response.get("usage", {}).get("prompt_tokens", 0)
output_tokens = response.get("usage", {}).get("completion_tokens", 0)
token_input.labels(model=model_name).inc(input_tokens)
token_output.labels(model=model_name).inc(output_tokens)
cost_per_1k = PRICING[model_name]
cost_cents = (input_tokens * cost_per_1k["input"] + output_tokens * cost_per_1k["output"]) / 1000 * 100
cost_total.labels(model=model_name).inc(cost_cents)

Define a per-model pricing dictionary at the top of the module:

PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "claude-3.5-sonnet": {"input": 3.00, "output": 15.00},
}

Expose metrics on a dedicated HTTP endpoint. In the FastAPI or Flask app, add a /metrics route:

@app.get("/metrics")
def metrics():
    return Response(content=generate_latest(registry), media_type="text/plain")

Add the metrics endpoint as a Prometheus scrape target in prometheus.yml:

scrape_configs:
  - job_name: "ai-agent"
    static_configs:
      - targets: ["localhost:8000"]

Reload Prometheus configuration:
```
curl -X POST http://localhost:9090/-/reload
```
Expected output: 200 OK.
Verify token counters are being scraped by querying ai_token_input_total in the Prometheus expression browser at http://localhost:9090/graph.

Verification

curl -s http://localhost:8000/metrics | grep "ai_token_input_total"

Expected output: a line like ai_token_input_total{model="gpt-4o"} 12345.0 confirming the counter is exposed.

Common failures

Counters return empty — ensure the LLM response includes a usage field. Test with print(response.get("usage")) to confirm the structure matches expectations.
Prometheus scrape target down — verify the agent service is listening on the correct port with ss -tlnp | grep 8000.
Cost calculation produces zero — confirm the model name returned by the API exactly matches a key in the PRICING dictionary.
Duplicated metrics on reload — use a module-level CollectorRegistry rather than the default registry to avoid duplicate registration errors.

How to monitor AI agent token usage and cost in real-time using Prometheus counters

What this does

Steps

Verification

Common failures

Related guides