Training Pipeline Overview — Custom Training Pipelines (Chapter 1)

A training pipeline is not a script—it is a system. Treating it as a script leads to irreproducible results, undebuggable failures, and production incidents at 3 AM. This chapter establishes the mental model for everything that follows.

The Pipeline Abstraction

A training pipeline has five logical stages:

Data Ingestion – Fetching raw data from storage (S3, GCS, local filesystem)
Data Processing – Transformations, cleaning, feature engineering
Batching – Grouping samples into mini-batches with proper collation
Training Loop – Forward pass, loss computation, backward pass, weight updates
Checkpointing and Logging – Saving state and tracking metrics

Each stage has distinct I/O characteristics. Data ingestion is I/O-bound. Training loop is compute-bound. Mixing these leads to GPU starvation.

# A minimal but complete pipeline skeleton
import torch
from torch.utils.data import DataLoader

def training_pipeline(config):
    model = build_model(config)
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr)
    
    train_loader = build_dataloader(config, split="train")
    val_loader = build_dataloader(config, split="val")
    
    for epoch in range(config.epochs):
        train_epoch(model, optimizer, train_loader)
        val_loss = validate(model, val_loader)
        checkpoint(model, optimizer, epoch, val_loss)
    
    return model

Why Structure Matters

Jupyter notebooks fail for pipelines because they have hidden state, non-reproducible cell ordering, and no checkpointing between runs. Scripts fail because they mix concerns—data loading, training, and logging tangled together makes debugging a nightmare.

The answer is modular code with explicit interfaces between stages. Pass a config object between stages. Never use global variables.

Failure Modes to Expect

GPU utilization drops to 0% because the data loader is too slow (single-threaded reading, no prefetching)
OOM errors from batch sizes too large for available memory
Stale checkpoints because the save logic runs before validation completes
Metric confusion when training and validation metrics are computed differently