07. Data Parallelism
Data parallelism replicates the model across GPUs, splitting batches. It's the most common distributed strategy because it works with any model that fits on a single GPU.
DDP Fundamentals
DistributedDataParallel (DDP) replicates gradients across GPUs through all-reduce operations:
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def train_with_ddp(config):
setup_distributed()
local_rank = int(os.environ["LOCAL_RANK"])
model = build_model(config).cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# All processes see the same initialization
# Forward pass: each GPU processes batch_size // num_gpus samples
# Backward pass: gradients are all-reduced across GPUs
# Optimizer step: identical on all GPUs
for epoch in range(config.epochs):
for batch in train_loader:
inputs = batch["input"].cuda(local_rank)
targets = batch["target"].cuda(local_rank)
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward() # Gradients synchronized automatically
optimizer.step()
optimizer.zero_grad()
# Only rank 0 logs to avoid duplicate entries
if local_rank == 0:
log_metrics({"loss": loss.item()})
cleanup_distributed()
Gradient Bucketing
DDP buckets gradients to overlap communication with computation. The bucket size affects performance—too small creates excessive communication overhead, too large wastes memory:
# Default bucket cap_mb=25MB; tune for your network bandwidth
model = DDP(model, device_ids=[local_rank], bucket_cap_mb=50)
Performance Metrics
Monitor these to diagnose DDP issues:
# Effective throughput
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
# Communication time (should be < 10% of step time)
# Check for "NCCL timeout" errors in logs
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Train a model with DDP on 2 GPUs and measure throughput per GPU. Compare to single-GPU throughput. The ratio should approach 2x—significantly lower indicates communication bottleneck.