Data Parallelism — Custom Training Pipelines (Chapter 7)

Data parallelism replicates the model across GPUs, splitting batches. It's the most common distributed strategy because it works with any model that fits on a single GPU.

DDP Fundamentals

DistributedDataParallel (DDP) replicates gradients across GPUs through all-reduce operations:

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def train_with_ddp(config):
    setup_distributed()
    local_rank = int(os.environ["LOCAL_RANK"])
    
    model = build_model(config).cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])
    
    # All processes see the same initialization
    # Forward pass: each GPU processes batch_size // num_gpus samples
    # Backward pass: gradients are all-reduced across GPUs
    # Optimizer step: identical on all GPUs
    
    for epoch in range(config.epochs):
        for batch in train_loader:
            inputs = batch["input"].cuda(local_rank)
            targets = batch["target"].cuda(local_rank)
            
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
            
            loss.backward()  # Gradients synchronized automatically
            optimizer.step()
            optimizer.zero_grad()
            
            # Only rank 0 logs to avoid duplicate entries
            if local_rank == 0:
                log_metrics({"loss": loss.item()})
    
    cleanup_distributed()

Gradient Bucketing

DDP buckets gradients to overlap communication with computation. The bucket size affects performance—too small creates excessive communication overhead, too large wastes memory:

# Default bucket cap_mb=25MB; tune for your network bandwidth
model = DDP(model, device_ids=[local_rank], bucket_cap_mb=50)

Performance Metrics

Monitor these to diagnose DDP issues:

# Effective throughput
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv

# Communication time (should be < 10% of step time)
# Check for "NCCL timeout" errors in logs

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.