Distributed Training
Distributed training splits the work of training a neural network across multiple GPUs or machines, using techniques like data parallelism (each GPU trains on a subset of data and syncs gradients) or model parallelism (each GPU holds a slice of the model). For local AI operators, this matters because training large models (e.g., Llama 3.1 70B) on a single consumer GPU is impossible due to VRAM limits—distributed training across multiple GPUs or machines is the only way to fit the model and data. However, it requires high-speed interconnects (NVLink, InfiniBand) and orchestration software (DeepSpeed, FSDP, torch.distributed) that most consumer rigs lack.
Deeper dive
Distributed training is the standard method for training models too large for one GPU. The most common form is data parallel: each GPU holds a full copy of the model, processes a different batch of data, then averages gradients. This scales well but requires each GPU to have enough VRAM for the entire model. For models that don't fit, model parallelism splits layers across GPUs (pipeline parallelism) or even splits individual tensors (tensor parallelism). ZeRO (Zero Redundancy Optimizer) from DeepSpeed reduces memory by partitioning optimizer states, gradients, and parameters across GPUs. All these methods require frequent communication—gradient sync after every step—so network bandwidth is critical. Consumer setups with Ethernet (1-10 Gbps) are far slower than NVLink (600 GB/s) or InfiniBand (400 Gbps), making distributed training on a home cluster impractical for large models. For small models (e.g., fine-tuning a 7B model on two RTX 3090s via data parallelism), it can work with acceptable overhead.
Practical example
Training Llama 3.1 70B from scratch requires 140 GB of VRAM for the model alone (FP16). A single RTX 4090 has 24 GB. To train it, you'd need at least 6 RTX 4090s with data parallelism (each holds a full copy, so 6×24=144 GB total VRAM, but each copy still needs 140 GB—doesn't fit). Instead, model parallelism splits the 70B model across 6 GPUs: each holds ~23 GB of parameters, plus gradients and optimizer states (46 GB per GPU). That fits. But the GPUs must communicate every layer—NVLink reduces latency; Ethernet would be too slow.
Workflow example
In practice, an operator fine-tuning a 7B model on two RTX 3090s might use Hugging Face Transformers with torchrun --nproc_per_node=2 train.py. The script uses DistributedDataParallel (DDP) to replicate the model on each GPU, split the batch, and sync gradients. The operator sets per_device_train_batch_size=1 to fit VRAM (each GPU holds the full 7B model ~14 GB in FP16). They monitor GPU utilization with nvidia-smi—if one GPU is idle, the bottleneck is gradient sync over PCIe. For larger models, they'd switch to DeepSpeed ZeRO-3: deepspeed --num_gpus=4 train.py --deepspeed ds_config.json.
Reviewed by Fredoline Eruo. See our editorial policy.