Data Pipeline Design — Custom Training Pipelines (Chapter 2)

The data pipeline is often the bottleneck in ML training. A 32-TFLOPS GPU sitting idle waiting for data is a $30,000 paperweight.

The DataLoader Architecture

PyTorch's DataLoader handles batching, shuffling, multiprocessing, and collation. Understanding its internals prevents common mistakes.

from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, data_path, transform=None):
        self.data = load_from_disk(data_path)
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        sample = self.data[idx]
        if self.transform:
            sample = self.transform(sample)
        return sample

# Common configuration mistakes
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,              # DON'T shuffle here for sequential data
    num_workers=4,             # Under-configured for NVMe storage
    pin_memory=True,           # Critical for GPU training
    prefetch_factor=2,         # Default is abysmal
    persistent_workers=True,   # Avoid worker restarts between epochs
    drop_last=True             # Prevents batch-size variance at epoch end
)

num_workers: The Hidden Lever

num_workers=0 means the main thread loads data synchronously. For a batch loading time of 50ms, a GPU processing 10ms per batch will be idle 80% of the time.

Rule of thumb: num_workers = num_cpus / gpus as a starting point, then benchmark. More workers help when:

Data lives on network storage (S3, GCS)
Transformations are CPU-heavy (image decoding, tokenization)
Storage is NVMe with high IOPS

The pin_memory Trap

pin_memory=True accelerates CPU-to-GPU transfers but adds overhead on CPU-bound pipelines. It helps most when GPU computation is fast relative to data loading. Profile before assuming it helps.

# Benchmark data loading performance
import time

loader = DataLoader(dataset, batch_size=32, num_workers=4)

start = time.perf_counter()
for batch in loader:
    pass
elapsed = time.perf_counter() - start
print(f"Total time: {elapsed:.2f}s")
print(f"Time per batch: {elapsed / len(loader):.4f}s")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.