16. CI/CD for ML

Chapter 16 of 24 · 20 min

KEY INSIGHT

Continuous Integration and Continuous Deployment principles apply to machine learning, but ML CI/CD differs fundamentally from software CI/CD. Model artifacts cannot be unit tested in isolation—they require evaluation against data, making your test suite dependent on distribution assumptions that may shift. ### The ML CI/CD Challenge Traditional software CI/CD tests code behavior: function X with input Y produces output Z. The same inputs always produce the same outputs. ML models don't guarantee this. The same input can produce different outputs across model versions, training runs, or random seeds. ML CI/CD must also evaluate whether a new model variant improves or degrades performance on relevant data. This requires maintaining evaluation datasets, defining acceptable performance bounds, and implementing comparison frameworks. ### Pipeline Architecture ```yaml # YAML: ML CI/CD pipeline structure (example Airflow DAG concept) # In practice: Airflow, Prefect, Metaflow, or similar orchestration name: ml_model_training_pipeline schedule: "0 2 * * *" # Daily at 2 AM stages: # Stage 1: Data validation - name: validate_data tasks: - check_schema: dataset: training_data expected_columns: [feature_1, feature_2, feature_3] - check_distribution: dataset: training_data reference: baseline_distribution.json threshold: 0.1 # Max KL divergence # Stage 2: Training - name: train_model tasks: - train: model_type: gradient_boosted hyperparameters: config/hyperparameters.yaml data: validated_training_data - register: artifact: trained_model_v{version} metrics: training_metrics.json # Stage 3: Evaluation - name: evaluate_model tasks: - unit_tests: runs: 50 # Fast smoke tests performance_threshold: 0.85 - integration_tests: runs: full_validation_set performance_threshold: 0.82 compared_to: production_model - shadow_tests: traffic_percentage: 5 duration_hours: 48 # Stage 4: Deployment - name: deploy_model condition: all_previous_stages_passed tasks: - promote: source: staging target: production - rollback_plan: previous_version: retained auto_rollback_threshold: degradation > 0.05 ``` ### Testing Levels for ML Systems **Unit tests** verify model components function correctly in isolation: preprocessing transformations, feature engineering logic, loss function implementations. These tests run quickly and catch obvious bugs. **Model tests** verify the trained model behaves correctly: output shapes are expected, confidence scores are in valid ranges, batching produces consistent results. These tests are relatively fast (seconds to minutes). **Performance tests** evaluate model quality: accuracy, latency, throughput, memory usage. Slow tests run against full validation sets and must pass before deployment. **Shadow tests** deploy new models alongside production models, routing isolated traffic to the new model without affecting users. Compare predictions and performance without live risk. ### Integration with Local Deployment Local AI deployments complicate CI/CD because you may not have identical environments across development and production. A model trained on your build server must perform identically across your deployed edge devices. ```python # Python: Environment consistency validation import subprocess import hashlib import json from pathlib import Path class EnvironmentValidator: """ Validates that training and serving environments are consistent. Critical for local AI deployment where hardware varies. """ def __init__(self, expected_dependencies: list[str]): self.expected_dependencies = expected_dependencies def capture_environment_hash(self) -> dict: """Capture environment characteristics for comparison.""" # Core library versions deps = {} for dep in self.expected_dependencies: try: result = subprocess.run( ["pip", "show", dep], capture_output=True, text=True ) version = "unknown" for line in result.stdout.split("\n"): if line.startswith("Version:"): version = line.split(":", 1)[1].strip() break deps[dep] = version except: deps[dep] = "not_found" # Python version py_version = subprocess.run( ["python", "--version"], capture_output=True, text=True ).stdout.strip() # Hardware info (for non-trivial hardware dependencies) cpu_info = self._get_cpu_info() cuda_available = self._check_cuda() return { "dependencies": deps, "python_version": py_version, "cpu_info": cpu_info, "cuda_available": cuda_available, "timestamp": int(subprocess.time.time()) } def validate_consistency( self, training_env: dict, serving_env: dict ) -> tuple[bool, list[str]]: """Compare environments, flagging inconsistencies.""" issues = [] # Check dependency alignment for dep, expected_version in training_env["dependencies"].items(): serving_version = serving_env["dependencies"].get(dep, "not_found") if serving_version != expected_version: issues.append( f"Dependency mismatch: {dep} " f"expected {expected_version}, got {serving_version}" ) # Critical checks based on your deployment requirements if training_env.get("cuda_available") != serving_env.get("cuda_available"): issues.append("CUDA availability mismatch between environments") return len(issues) == 0, issues def _get_cpu_info(self) -> str: try: with open("/proc/cpuinfo") as f: return f.readline().strip() except: return "unknown" def _check_cuda(self) -> bool: try: import torch return torch.cuda.is_available() except: return False ``` ### CI/CD Anti-Patterns Avoid testing models only against training data—performance on training data is meaningless. Test against held-out validation data and ideally against production traffic patterns. Avoid deploying without rollback capability. Auto-rollback on degradation is not optional for production ML systems. You will occasionally ship models that perform worse than their predecessors. Avoid manual promotion gates. If humans must manually approve deployments, approval becomes rubber-stamp behavior. Automate quality gates and route exceptions to human review, starting with automated notification.

EXERCISE

Build a minimal CI pipeline for your model. Include: (1) data validation checks, (2) training run with performance logging, (3) evaluation against a held-out test set with pass/fail thresholds, (4) artifact registration on success. Run the pipeline and verify it fails appropriately when your test data quality degrades.