07. Pipeline Orchestration
Pipeline orchestration transforms ad-hoc training scripts into automated, reproducible workflows. Instead of "run this script when you remember," you define pipelines that trigger on schedule, data arrival, or upstream completion.
A pipeline is a directed acyclic graph (DAG) of tasks. Each task is a discrete unit: fetch data, validate data, train model, evaluate model, deploy model. Orchestrators handle execution order, failure recovery, and logging.
Why orchestrate ML pipelines specifically? ML pipelines have unique characteristics: data-dependent execution times, resource-intensive training steps, evaluation gates that can halt progression, and retraining triggered by drift detection.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Identify three training workflows you currently run manually. For each, document: inputs, outputs, execution time, failure modes, and dependencies. This becomes your candidate for orchestration.