02. Data Pipeline Design
The data pipeline is often the bottleneck in ML training. A 32-TFLOPS GPU sitting idle waiting for data is a $30,000 paperweight.
The DataLoader Architecture
PyTorch's DataLoader handles batching, shuffling, multiprocessing, and collation. Understanding its internals prevents common mistakes.
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self, data_path, transform=None):
self.data = load_from_disk(data_path)
self.transform = transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
if self.transform:
sample = self.transform(sample)
return sample
# Common configuration mistakes
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True, # DON'T shuffle here for sequential data
num_workers=4, # Under-configured for NVMe storage
pin_memory=True, # Critical for GPU training
prefetch_factor=2, # Default is abysmal
persistent_workers=True, # Avoid worker restarts between epochs
drop_last=True # Prevents batch-size variance at epoch end
)
num_workers: The Hidden Lever
num_workers=0 means the main thread loads data synchronously. For a batch loading time of 50ms, a GPU processing 10ms per batch will be idle 80% of the time.
Rule of thumb: num_workers = num_cpus / gpus as a starting point, then benchmark. More workers help when:
- Data lives on network storage (S3, GCS)
- Transformations are CPU-heavy (image decoding, tokenization)
- Storage is NVMe with high IOPS
The pin_memory Trap
pin_memory=True accelerates CPU-to-GPU transfers but adds overhead on CPU-bound pipelines. It helps most when GPU computation is fast relative to data loading. Profile before assuming it helps.
# Benchmark data loading performance
import time
loader = DataLoader(dataset, batch_size=32, num_workers=4)
start = time.perf_counter()
for batch in loader:
pass
elapsed = time.perf_counter() - start
print(f"Total time: {elapsed:.2f}s")
print(f"Time per batch: {elapsed / len(loader):.4f}s")
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run nvidia-smi during training for 60 seconds. Calculate GPU utilization. If it's below 80%, the data pipeline is starving the GPU. Increase num_workers by 2x and re-benchmark.