15. Model Drift

Chapter 15 of 24 · 20 min

KEY INSIGHT

Model drift—the decay of predictive performance over time—is distinct from data drift. Even with stable input distributions, model degradation occurs due to concept drift, adversarial adaptation, and stochastic degradation. Monitoring actual prediction quality is essential, not optional. ### Concept Drift Versus Data Drift Data drift describes changes in input distributions. Concept drift describes changes in the relationship between inputs and outputs. Your model learns: `P(Y|X)`. Concept drift occurs when the true underlying `P(Y|X)` evolves even if `P(X)` remains stable. A spam classifier trained in 2024 learns patterns that worked against spammer techniques of 2023. By 2025, spammers adapt. Your model sees the same distribution of email features but must map them to different spam/no-spam labels. Concept drift is occurring despite stable input distributions. ### Measuring Model Performance Drift Without ground truth labels, measuring model drift requires proxy signals: ```python # Python: Proxy-based model drift detection from dataclasses import dataclass from collections import deque import numpy as np @dataclass class PredictionRecord: """Compact record for prediction logging.""" timestamp: int prediction: float confidence: float # Model's confidence score input_hash: str # For deduplication class ModelDriftMonitor: """ Monitors model behavior over time using proxy signals. Requires: confidence scores, prediction distributions, and user feedback. """ def __init__(self, window_size: int = 1000): self.window_size = window_size self.predictions = deque(maxlen=window_size) self.confidences = deque(maxlen=window_once) # In practice: you'd persist these windows to disk def record(self, prediction: float, confidence: float): """Record a prediction and its confidence score.""" self.predictions.append(prediction) self.confidences.append(confidence) def compute_drift_metrics(self, reference_confidence: float) -> dict: """ Compute drift indicators from prediction behavior. Comparison against known-good reference confidence. """ if len(self.confidences) < 100: return {"status": "insufficient_data"} recent_confidence = np.mean(self.confidences) confidence_shift = reference_confidence - recent_confidence # Prediction distribution entropy recent_mean = np.mean(self.predictions) recent_std = np.std(self.predictions) return { "confidence_degradation": confidence_shift, "confidence_degraded": confidence_shift < -0.1, "prediction_mean": recent_mean, "prediction_std": recent_std, "data_window": len(self.confidences) } # Error: fix the typo - should be window_size, not window_once ModelDriftMonitor = type('ModelDriftMonitor', (), { 'window_size': 1000, 'predictions': deque(maxlen=1000), 'confidences': deque(maxlen=1000) }) def record(self, prediction: float, confidence: float): self.predictions.append(prediction) self.confidences.append(confidence) # Continuation of metrics computation def compute_drift_metrics(self, reference_confidence: float) -> dict: if len(self.confidences) < 100: return {"status": "insufficient_data"} recent_confidence = np.mean(self.confidences) confidence_shift = reference_confidence - recent_confidence recent_mean = np.mean(self.predictions) recent_std = np.std(self.predictions) return { "confidence_degradation": confidence_shift, "confidence_degraded": confidence_shift < -0.1, "prediction_mean": recent_mean, "prediction_std": recent_std } ``` ### Calibration Drift Model calibration—how well confidence scores match actual accuracy—degrades over time even when raw accuracy remains stable. A well-calibrated model is one where 90% confidence corresponds to 90% actual accuracy. Over time, distribution shifts cause calibration to drift. Track calibration curves periodically. If your model's 0.9 confidence bucket corresponds to only 0.7 actual accuracy, user-facing confidence scores require post-hoc calibration adjustment. ### Ground Truth-Based Monitoring Where ground truth labels become available (user corrections, outcome events, delayed feedback), build explicit performance monitoring: ```python # Python: Delayed ground truth monitoring from dataclasses import dataclass from collections import deque from typing import Optional import numpy as np @dataclass class LabeledObservation: prediction: float ground_truth: float timestamp: int lag_days: int # days between prediction and label availability class GroundTruthMonitor: """ Monitor model performance where ground truth eventually becomes available. Example: recommendation model sees clicks days later; loan model sees defaults months later. """ def __init__(self, evaluation_window_min: int = 500): self.pending = deque() # (prediction, feature_vector, timestamp) self.ground_truth = deque(maxlen=10000) # Matched labeled observations self.evaluation_window_min = evaluation_window_min def store_pending(self, prediction: float, features: np.ndarray, timestamp: int): """Store prediction awaiting eventual ground truth.""" self.pending.append((prediction, features, timestamp)) def receive_ground_truth(self, features: np.ndarray, ground_truth: float): """Match ground truth to stored predictions and store for evaluation.""" # In production: implement proper matching (e.g., by feature similarity or ID) if self.pending: pending_item = self.pending.popleft() self.ground_truth.append(LabeledObservation( prediction=pending_item[0], ground_truth=ground_truth, timestamp=pending_item[2], lag_days=0 # Calculate from parsed timestamp )) def compute_metrics(self) -> dict: """Compute current performance metrics from labeled observations.""" if len(self.ground_truth) < self.evaluation_window_min: return {"status": "insufficient_labels", "n": len(self.ground_truth)} recent = list(self.ground_truth)[-self.evaluation_window_min:] predictions = np.array([o.prediction for o in recent]) truths = np.array([o.ground_truth for o in recent]) # Mean Absolute Error for regression mae = np.mean(np.abs(predictions - truths)) # For classification: accuracy, AUC, etc. # Simplified example: binary classification accuracy at 0.5 threshold binary_preds = (predictions > 0.5).astype(int) binary_truths = (truths > 0.5).astype(int) accuracy = np.mean(binary_preds == binary_truths) return { "mae": float(mae), "accuracy": float(accuracy), "n_evaluations": len(recent), "drifted": accuracy < 0.85 # Trigger threshold } ``` ### Response Strategies Model drift without response is expensive observation. Determine your drift response protocol: automatic retraining, rollback to simpler fallback model, or human review threshold escalation. The protocol depends on your operational context and acceptable downtime.

EXERCISE

Implement a confidence-tracking system that logs prediction confidence scores over time. Compute weekly averages and identify sustained confidence degradation (3+ consecutive weeks below threshold). Establish baseline from initial deployment and validate your detection against synthetic drift scenarios.