How to Build a Real-Time AI Monitoring Dashboard
Grafana or Streamlit, Prometheus/metrics source
What this does
A real-time monitoring dashboard surfaces operational metrics for AI systems: request throughput, latency percentiles, error rates, and token consumption per model. The dashboard helps on-call engineers detect regressions immediately and gives product teams visibility into usage patterns.
Steps
Step 1 — Instrument the application with metrics.
Add a metrics SDK to the application. For Python, use the prometheus-client library. Emit four core metrics: ai_requests_total (Counter, labels: model, status), ai_request_duration_seconds (Histogram, labels: model), ai_tokens_total (Counter, labels: model, type where type is prompt or completion), and ai_errors_total (Counter, labels: model, error_type).
Step 2 — Expose a /metrics endpoint.
Mount the Prometheus scrape endpoint on an HTTP port (e.g., port 8000). This endpoint must not require authentication since Prometheus will poll it. Verify the endpoint returns metrics in the Prometheus text format by curling it directly.
Step 3 — Configure Prometheus to scrape the application.
Add a scrape job to prometheus.yml targeting the application's /metrics endpoint. Set the scrape interval to 15 seconds for near-real-time visibility. Validate the scrape succeeds in the Prometheus targets UI.
Step 4 — Build the dashboard in Grafana or Streamlit.
Option A — Grafana: Import the Prometheus data source, create a new dashboard, and add panels for each metric. Use Grafana's rate() function for throughput, histogram_quantile(0.95, ...) for p95 latency, and sum(rate(...)) for token rates.
Option B — Streamlit: Query Prometheus via the REST API every 10 seconds using prometheus-api-client. Render the metrics as line charts with st.line_chart. Use st.metric widgets for current values and st.columns to arrange panels side by side.
Step 5 — Add alerting rules.
Create Prometheus alerting rules that fire when: p95 latency exceeds 5 seconds, error rate exceeds 5%, or token usage exceeds a daily quota. Route alerts to a notification channel (email, Slack, PagerDuty).
Step 6 — Validate with synthetic traffic.
Use a load test script to send a fixed rate of requests (e.g., 50 requests/minute) to the AI application. Confirm the dashboard reflects the correct throughput, latency distribution, and token counts within two scrape cycles.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
- Generate 1,000 requests with 1% random failures. Confirm the dashboard shows approximately 1,000 requests, ~10 errors, and a p95 latency within the expected range.
- Stop the application. Confirm the dashboard shows the throughput dropping to zero and the error metric spiking.
- Check the token counter. Confirm prompt and completion tokens are reported separately and sum to the expected volume.
Common failures
- Cardinality explosion: Adding high-cardinality labels (e.g.,
user_idorrequest_id) to counters causes memory pressure in Prometheus. Keep label sets small and static. - Missing histogram buckets: If latency buckets are not configured to cover the expected range,
histogram_quantilereturns NaN. Set explicit bucket boundaries (e.g.,buckets: [0.1, 0.5, 1, 2, 5, 10, 30]). - Scrape timeout: If the application takes longer than Prometheus's scrape timeout to respond, metrics silently drop. Ensure the
/metricsendpoint responds in under 5 seconds.
Related guides
- How to Set Up Model Fallback Chains (Local to Cloud) — monitor latency on both local and cloud paths
- How to Implement AI Agent Logging and Audit Trails — provides structured log data that complements metrics