01. Training Pipeline Overview
A training pipeline is not a script—it is a system. Treating it as a script leads to irreproducible results, undebuggable failures, and production incidents at 3 AM. This chapter establishes the mental model for everything that follows.
The Pipeline Abstraction
A training pipeline has five logical stages:
- Data Ingestion – Fetching raw data from storage (S3, GCS, local filesystem)
- Data Processing – Transformations, cleaning, feature engineering
- Batching – Grouping samples into mini-batches with proper collation
- Training Loop – Forward pass, loss computation, backward pass, weight updates
- Checkpointing and Logging – Saving state and tracking metrics
Each stage has distinct I/O characteristics. Data ingestion is I/O-bound. Training loop is compute-bound. Mixing these leads to GPU starvation.
# A minimal but complete pipeline skeleton
import torch
from torch.utils.data import DataLoader
def training_pipeline(config):
model = build_model(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr)
train_loader = build_dataloader(config, split="train")
val_loader = build_dataloader(config, split="val")
for epoch in range(config.epochs):
train_epoch(model, optimizer, train_loader)
val_loss = validate(model, val_loader)
checkpoint(model, optimizer, epoch, val_loss)
return model
Why Structure Matters
Jupyter notebooks fail for pipelines because they have hidden state, non-reproducible cell ordering, and no checkpointing between runs. Scripts fail because they mix concerns—data loading, training, and logging tangled together makes debugging a nightmare.
The answer is modular code with explicit interfaces between stages. Pass a config object between stages. Never use global variables.
Failure Modes to Expect
- GPU utilization drops to 0% because the data loader is too slow (single-threaded reading, no prefetching)
- OOM errors from batch sizes too large for available memory
- Stale checkpoints because the save logic runs before validation completes
- Metric confusion when training and validation metrics are computed differently
Draw your current training setup as five boxes connected by arrows. Identify which box is the bottleneck by asking: "What is the GPU waiting for?"