Data Parallelism
Data parallelism is a distributed training strategy where a model is replicated across multiple devices (GPUs or nodes), and each replica processes a different subset of the training data in parallel. Gradients from all replicas are averaged after each step to update the shared model weights. For operators running local AI, data parallelism matters when training or fine-tuning models on multiple GPUs: it scales throughput linearly with device count but requires high-bandwidth interconnects (e.g., NVLink) to avoid communication bottlenecks. It does not reduce per-device memory footprint—each GPU holds a full copy of the model.
Deeper dive
In data parallelism, each device maintains a complete copy of the model parameters. During training, the batch is split into micro-batches, each assigned to a device. After forward and backward passes, gradients are synchronized (e.g., via all-reduce) and averaged. The optimizer then updates the parameters on each device identically. Variants include synchronous (standard) and asynchronous (stale gradients) data parallelism. For operators, the key trade-off is between compute scaling and communication overhead. On a multi-GPU rig with fast interconnects (e.g., 4× RTX 4090 via NVLink), data parallelism can achieve near-linear speedup. Without fast interconnects (e.g., Ethernet), communication can dominate, making it inefficient. Tools like PyTorch DDP, Hugging Face Accelerate, and vLLM (for inference) implement data parallelism. It is distinct from model parallelism, which splits the model itself across devices.
Practical example
Fine-tuning Llama 3.1 8B on two RTX 3090s (24 GB each) using PyTorch DDP: each GPU holds the full 8B model (~16 GB in FP16). With a global batch size of 8, each GPU processes 4 samples. After backward, gradients are all-reduced across GPUs. Throughput roughly doubles compared to a single GPU, but training loss remains identical. If using Ethernet instead of NVLink, communication overhead may reduce speedup to ~1.5×.
Workflow example
In Hugging Face Transformers, enable data parallelism by setting --num_processes 2 in accelerate launch. The runtime splits the batch across GPUs and synchronizes gradients automatically. In vLLM, data parallelism is used for serving large models across multiple GPUs—each GPU runs a full model replica and handles a portion of incoming requests. Operators monitor GPU utilization and communication time via nvidia-smi and torch.distributed profiling.
Reviewed by Fredoline Eruo. See our editorial policy.