How to monitor AI agent token usage and cost in real-time using Prometheus counters
Prometheus running, AI API endpoint with token metadata
What this does
This guide creates Prometheus counter metrics that track total input tokens, output tokens, and estimated API cost for every AI model inference call. The counters accumulate across all agent interactions, enabling real-time cost dashboards and budget alerting. Token metadata is extracted from API response headers and converted to monetary values using per-model pricing tables.
Steps
Install the Prometheus Python client:
pip install prometheus-clientExpected output:
Successfully installed prometheus-client-0.19.0.Define token counters in a new
metrics.pymodule:from prometheus_client import Counter, generate_latest, CollectorRegistry registry = CollectorRegistry() token_input = Counter("ai_token_input_total", "Input tokens consumed", ["model"], registry=registry) token_output = Counter("ai_token_output_total", "Output tokens generated", ["model"], registry=registry) cost_total = Counter("ai_inference_cost_cents_total", "Estimated cost in cents", ["model"], registry=registry)Hook the counters into the LLM call path. After each API response, extract token counts and increment:
input_tokens = response.get("usage", {}).get("prompt_tokens", 0) output_tokens = response.get("usage", {}).get("completion_tokens", 0) token_input.labels(model=model_name).inc(input_tokens) token_output.labels(model=model_name).inc(output_tokens) cost_per_1k = PRICING[model_name] cost_cents = (input_tokens * cost_per_1k["input"] + output_tokens * cost_per_1k["output"]) / 1000 * 100 cost_total.labels(model=model_name).inc(cost_cents)Define a per-model pricing dictionary at the top of the module:
PRICING = { "gpt-4o": {"input": 2.50, "output": 10.00}, "claude-3.5-sonnet": {"input": 3.00, "output": 15.00}, }Expose metrics on a dedicated HTTP endpoint. In the FastAPI or Flask app, add a
/metricsroute:@app.get("/metrics") def metrics(): return Response(content=generate_latest(registry), media_type="text/plain")Add the metrics endpoint as a Prometheus scrape target in
prometheus.yml:scrape_configs: - job_name: "ai-agent" static_configs: - targets: ["localhost:8000"]Reload Prometheus configuration:
curl -X POST http://localhost:9090/-/reloadExpected output:
200 OK.Verify token counters are being scraped by querying
ai_token_input_totalin the Prometheus expression browser athttp://localhost:9090/graph.
Verification
curl -s http://localhost:8000/metrics | grep "ai_token_input_total"
Expected output: a line like ai_token_input_total{model="gpt-4o"} 12345.0 confirming the counter is exposed.
Common failures
- Counters return empty — ensure the LLM response includes a
usagefield. Test withprint(response.get("usage"))to confirm the structure matches expectations. - Prometheus scrape target down — verify the agent service is listening on the correct port with
ss -tlnp | grep 8000. - Cost calculation produces zero — confirm the model name returned by the API exactly matches a key in the
PRICINGdictionary. - Duplicated metrics on reload — use a module-level
CollectorRegistryrather than the default registry to avoid duplicate registration errors.