17. Pipeline as Code

Chapter 17 of 24 · 25 min

KEY INSIGHT

Defining ML pipelines as source code ensures reproducibility, version control, and auditability. Pipeline as code treats data processing, feature engineering, training, and evaluation as declarative specifications rather than manual procedures. ### Why Pipeline as Code Matters Manual pipeline execution creates invisible dependencies on operator knowledge. When the person who knows how to run training leaves, institutional knowledge leaves with them. Worse, manual execution is not reproducible—running the same steps with the same inputs does not guarantee identical outputs without explicit version control. Pipeline as code encodes the complete workflow: data sources, preprocessing steps, model configuration, evaluation criteria, and deployment targets. Anyone with code access can inspect, reproduce, or modify the workflow. ### Declarative Pipeline Definition ```python # Python: Declarative pipeline definition # Using a conceptual framework (implementation depends on tooling) @dataclass class PipelineStage: name: str component: str # Reference to component module config: dict dependencies: list[str] # Stage names this depends on @dataclass class Pipeline: name: str version: str stages: list[PipelineStage] metadata: dict # Owner, description, tags def render(self) -> dict: """Render pipeline definition to execution format.""" return { "pipeline": self.name, "version": self.version, "stages": [ { "name": s.name, "component": s.component, "config": s.config, "upstream": s.dependencies } for s in self.stages ] } # Example: Declarative training pipeline training_pipeline = Pipeline( name="recommender_training", version="v3.2.1", stages=[ PipelineStage( name="extract", component="data.extract.database", config={ "query": "SELECT * FROM user_events WHERE date > :cutoff", "connection": "analytics_db", "batch_size": 10000 }, dependencies=[] ), PipelineStage( name="transform_features", component="features.user_engagement", config={ "window_days": 30, "aggregation": ["mean", "std", "count"] }, dependencies=["extract"] ), PipelineStage( name="train", component="model.matrix_factorization", config={ "factors": 64, "regularization": 0.01, "epochs": 50, "early_stopping_patience": 5 }, dependencies=["transform_features"] ), PipelineStage( name="evaluate", component="evaluation.recommender", config={ "metrics": ["recall@10", "ndcg@10", "coverage"], "test_split": 0.2, "thresholds": { "recall@10": 0.15, "ndcg@10": 0.12 } }, dependencies=["train"] ), PipelineStage( name="register", component="registry.register_model", config={ "registry": "s3://models/acme/", "promote_if": "all_metrics_passed" }, dependencies=["evaluate"] ) ], metadata={ "owner": "recommendations-team", "description": "Collaborative filtering for content recommendations", "schedule": "0 3 * * *", "notify_on_failure": ["[email protected]"] } ) # Render to execution format (e.g., Airflow, Kubeflow, etc.) pipeline_def = training_pipeline.render() print(pipeline_def) ``` ### Config-Driven Variation Effective pipeline as code supports configuration-driven execution without code modification. Training hyperparameters, data sources, evaluation thresholds—these should be parameters, not hardcoded values. ```python # Python: Configuration-driven pipeline execution from dataclasses import dataclass, field from typing import Any import yaml @dataclass class PipelineConfig: """Configuration for pipeline execution.""" name: str environment: str # dev, staging, production data_source: str model_config: dict evaluation_config: dict deployment_config: dict @classmethod def from_yaml(cls, path: str) -> "PipelineConfig": """Load configuration from YAML file.""" with open(path) as f: config_dict = yaml.safe_load(f) return cls(**config_dict) @classmethod def for_environment(cls, base_config: str, env: str) -> "PipelineConfig": """Load base config, override with environment-specific settings.""" base = cls.from_yaml(base_config) env_overrides = cls.from_yaml(f"{base_config.replace('.yaml', '')}.{env}.yaml") # Merge overrides (env values take precedence) return cls( name=base.name, environment=env, data_source=env_overrides.data_source or base.data_source, model_config={**base.model_config, **env_overrides.model_config}, evaluation_config={**base.evaluation_config, **env_overrides.evaluation_config}, deployment_config={**base.deployment_config, **env_overrides.deployment_config} ) # Example YAML configuration # config/base.yaml # name: sentiment_training # data_source: s3://data/sentiment/normalized # model_config: # architecture: transformer # max_length: 512 # learning_rate: 0.0001 # evaluation_config: # test_size: 0.2 # metrics: [accuracy, f1, latency_p95] # deployment_config: # target: local_edge # replicas: 1 ``` ### Versioning Considerations Every pipeline execution should be traceable to its source code version and configuration version. Store pipeline definitions in version control alongside model artifacts. Tag releases with semantic versioning that indicates breaking changes versus incremental improvements. ```python # Python: Pipeline execution metadata tracking @dataclass class ExecutionRecord: """Record of pipeline execution for auditability.""" pipeline_version: str config_version: str execution_id: str # Unique per execution timestamp: str git_commit: str triggered_by: str # schedule, manual, api stages_completed: list[str] stages_failed: list[str] outputs: dict # Artifact locations, metrics def record_execution(pipeline: Pipeline, execution_id: str) -> ExecutionRecord: """Create execution record for tracking.""" import subprocess git_commit = subprocess.run( ["git", "rev-parse", "HEAD"], capture_output=True, text=True ).stdout.strip() return ExecutionRecord( pipeline_version=pipeline.version, config_version="N/A", # Populate from loaded config execution_id=execution_id, timestamp=subprocess.run( ["date", "-Iseconds"], capture_output=True, text=True ).stdout.strip(), git_commit=git_commit, triggered_by="unknown", # Populate from execution context stages_completed=[], stages_failed=[], outputs={} ) ```

EXERCISE

Define your model training pipeline declaratively. Include data validation, feature transformation, training, evaluation, and registration stages. Create separate configuration files for development and production environments. Verify that pipeline definition is valid YAML and can be parsed by your execution framework.