14. Data Drift

Chapter 14 of 24 · 20 min

KEY INSIGHT

Data drift occurs when the statistical distribution of your input features changes over time. Detecting data drift is the foundation of proactive model retraining—catching distribution shifts before they cascade into prediction quality degradation. ### The Mechanics of Feature Drift Every model makes an implicit assumption: the future will resemble the past. This assumption lives in the training data distribution. When users generate data that diverges from this distribution, predictions suffer. Consider a local AI system screening support tickets. Over months, your product evolves. New features attract different user segments. Query language shifts as cultural references change. The ticket categories distribution you trained on no longer matches reality. Your model increasingly sees out-of-distribution inputs it cannot reliably process. ### Detection Implementation Memory-efficient feature drift detection for edge deployment requires careful resource management: ```python # Python: Efficient feature drift detection for edge deployment from collections import deque import numpy as np from scipy.stats import ks_2samp class FeatureDriftDetector: """ Monitors individual feature distributions for statistically significant drift. Designed for resource-constrained edge deployment. """ def __init__(self, n_features: int, window_size: int = 500, alpha: float = 0.05): self.n_features = n_features self.window_size = window_size self.alpha = alpha # Significance level # Rolling buffers per feature (memory-efficient deque) self.buffers = [deque(maxlen=window_size) for _ in range(n_features)] self.reference_means = None self.reference_stds = None self.drift_counts = 0 def capture_baseline(self, baseline_data: np.ndarray): """Capture statistical baseline from training or known-good data.""" self.reference_means = np.mean(baseline_data, axis=0) self.reference_stds = np.std(baseline_data, axis=0) # Pre-populate buffers with baseline for warm start for i in range(min(len(baseline_data), self.window_size)): for feat_idx in range(self.n_features): self.buffers[feat_idx].append(baseline_data[i, feat_idx]) def ingest(self, features: np.ndarray): """Ingest a single observation's features.""" if features.shape[0] != self.n_features: raise ValueError(f"Expected {self.n_features} features, got {features.shape[0]}") for feat_idx, value in enumerate(features): self.buffers[feat_idx].append(value) def assess(self) -> dict: """ Assess drift across all features using KS test. Returns dict with drift status per feature and overall status. """ if self.reference_means is None: return {"drifted": False, "error": "No baseline established"} results = {"drifted": False, "features": {}} for feat_idx in range(self.n_features): feature_data = np.array(self.buffers[feat_idx]) # Compute current statistics current_mean = np.mean(feature_data) current_std = np.std(feature_data) # Normalize for KS test to handle scale differences normalized = (feature_data - current_mean) / (current_std + 1e-8) reference = (self.reference_means[feat_idx], self.reference_stds[feat_idx]) ref_normalized = (0, 1) # Standard normal for comparison # KS test against normal distribution # In practice, compare against stored reference samples # Simplified: compare Z-score locations drift_score = abs(current_mean - reference[0]) / (reference[1] + 1e-8) results["features"][feat_idx] = { "drifted": drift_score > 3.0, # 3-sigma rule "score": float(drift_score) } if drift_score > 3.0: results["drifted"] = True return results ``` ### Categorical Feature Drift Numerical features yield to statistical tests. Categorical features require different treatment. Monitor category frequency distributions, alert on emerging categories with zero training frequency, and track category elimination events. ```python # Python: Categorical distribution drift detection from collections import Counter import numpy as np def categorical_drift_score( current: Counter, reference: Counter, total_current: int, total_reference: int ) -> dict: """ Compute drift metrics for categorical features. Uses Total Variation Distance as primary metric. """ # Get union of all categories all_categories = set(current.keys()) | set(reference.keys()) # Compute probability distributions current_probs = {cat: current.get(cat, 0) / total_current for cat in all_categories} ref_probs = {cat: reference.get(cat, 0) / total_reference for cat in all_categories} # Total Variation Distance tvd = 0.5 * sum(abs(current_probs[cat] - ref_probs[cat]) for cat in all_categories) # Flag novel categories (in current but not in reference) novel = set(current.keys()) - set(reference.keys()) # Flag atrophied categories (in reference but not current) atrophied = set(reference.keys()) - set(current.keys()) return { "tvd": tvd, "max_tvd": 1.0, # Normalized scale "novel_categories": list(novel), "atrophied_categories": list(atrophied), "drifted": tvd > 0.1 or len(novel) > 0 or len(atrophied) > 0 } ``` ### Operational Implications Data drift detection without automated response is observation without action. Build alerting thresholds that trigger retraining workflows in your MLOps pipeline. Distinguish minor fluctuations (expected noise) from systematic shifts (requiring intervention).

EXERCISE

Collect a baseline dataset from your model's initial serving period. Store per-feature statistics. Build a scheduled task that samples current traffic, computes drift scores, and emits logs when scores exceed thresholds. Include both numerical and categorical feature handling.