Model Monitoring

Model Monitoring continuously tracks the health and performance of deployed ML models by measuring: (1) prediction quality — accuracy/F1 on labeled feedback (when available, on 0.1-5% of traffic), proxy metrics (prediction confidence, entropy), and human review rates; (2) data quality and drift — feature distribution changes measured by PSI (>0.1 warning, >0.25 alert), missing value rates (>5% triggers investigation), and schema violations; and (3) operational metrics — latency (p50, p95, p99), throughput (requests/second), error rates, and GPU memory utilization. Without monitoring, model degradation from data drift goes undetected for 3-8 weeks, during which a model making

Model monitoring tracks production model health — is the model still working as expected? Key signals: prediction drift (are outputs changing?), data drift (are inputs changing?), performance degradation (does accuracy still meet requirements?), and operational metrics (latency, throughput, error rate).

Model monitoring dashboard: (1) latency: P50, P95, P99 per endpoint — is the model getting slower?, (2) throughput: requests/second, tokens/second — is demand exceeding capacity?, (3) error rate: failed requests / total requests — is the model crashing?, (4) data drift: input distribution vs training distribution — has the world changed?, (5) output drift: prediction distribution over time — is the model's behavior changing?, (6) alerts: if any metric crosses threshold, notify on-call, (7) for LLMs: also monitor refusal rate, hallucination rate, and toxicity score.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example