What this does

A real-time monitoring dashboard surfaces operational metrics for AI systems: request throughput, latency percentiles, error rates, and token consumption per model. The dashboard helps on-call engineers detect regressions immediately and gives product teams visibility into usage patterns.

Steps

Step 1 — Instrument the application with metrics.

Add a metrics SDK to the application. For Python, use the prometheus-client library. Emit four core metrics: ai_requests_total (Counter, labels: model, status), ai_request_duration_seconds (Histogram, labels: model), ai_tokens_total (Counter, labels: model, type where type is prompt or completion), and ai_errors_total (Counter, labels: model, error_type).

Step 2 — Expose a /metrics endpoint.

Mount the Prometheus scrape endpoint on an HTTP port (e.g., port 8000). This endpoint must not require authentication since Prometheus will poll it. Verify the endpoint returns metrics in the Prometheus text format by curling it directly.

Step 3 — Configure Prometheus to scrape the application.

Add a scrape job to prometheus.yml targeting the application's /metrics endpoint. Set the scrape interval to 15 seconds for near-real-time visibility. Validate the scrape succeeds in the Prometheus targets UI.

Step 4 — Build the dashboard in Grafana or Streamlit.

Option A — Grafana: Import the Prometheus data source, create a new dashboard, and add panels for each metric. Use Grafana's rate() function for throughput, histogram_quantile(0.95, ...) for p95 latency, and sum(rate(...)) for token rates.

Option B — Streamlit: Query Prometheus via the REST API every 10 seconds using prometheus-api-client. Render the metrics as line charts with st.line_chart. Use st.metric widgets for current values and st.columns to arrange panels side by side.

Step 5 — Add alerting rules.

Create Prometheus alerting rules that fire when: p95 latency exceeds 5 seconds, error rate exceeds 5%, or token usage exceeds a daily quota. Route alerts to a notification channel (email, Slack, PagerDuty).

Step 6 — Validate with synthetic traffic.

Use a load test script to send a fixed rate of requests (e.g., 50 requests/minute) to the AI application. Confirm the dashboard reflects the correct throughput, latency distribution, and token counts within two scrape cycles.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Generate 1,000 requests with 1% random failures. Confirm the dashboard shows approximately 1,000 requests, ~10 errors, and a p95 latency within the expected range.
Stop the application. Confirm the dashboard shows the throughput dropping to zero and the error metric spiking.
Check the token counter. Confirm prompt and completion tokens are reported separately and sum to the expected volume.

Common failures

Cardinality explosion: Adding high-cardinality labels (e.g., user_id or request_id) to counters causes memory pressure in Prometheus. Keep label sets small and static.
Missing histogram buckets: If latency buckets are not configured to cover the expected range, histogram_quantile returns NaN. Set explicit bucket boundaries (e.g., buckets: [0.1, 0.5, 1, 2, 5, 10, 30]).
Scrape timeout: If the application takes longer than Prometheus's scrape timeout to respond, metrics silently drop. Ensure the /metrics endpoint responds in under 5 seconds.

Related guides

How to Set Up Model Fallback Chains (Local to Cloud) — monitor latency on both local and cloud paths
How to Implement AI Agent Logging and Audit Trails — provides structured log data that complements metrics