Hardware & infrastructure

ZeRO optimizer

ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for distributed training of large models. It partitions optimizer states, gradients, and parameters across multiple GPUs to reduce memory redundancy, enabling training of models with billions of parameters on clusters with limited per-GPU memory. Operators encounter ZeRO when using Hugging Face Transformers or DeepSpeed to train models like Llama 2 70B across multiple GPUs, as it allows fitting the model without requiring each GPU to hold a full copy of all parameters.

Deeper dive

ZeRO operates in three stages: Stage 1 partitions optimizer states (e.g., Adam momentum and variance) across GPUs, reducing memory per GPU by up to 4x. Stage 2 additionally partitions gradients, further reducing memory. Stage 3 partitions model parameters themselves, so each GPU holds only a fraction of the parameters at any time, fetching them on-demand during forward/backward passes. This enables training models with hundreds of billions of parameters on hundreds of GPUs. For operators, ZeRO is typically used via DeepSpeed or PyTorch FSDP (Fully Sharded Data Parallel). The trade-off is increased communication overhead, which can slow training if network bandwidth is limited. On consumer hardware with a single GPU, ZeRO is not applicable; it is designed for multi-GPU setups.

Practical example

Training Llama 2 70B (140 GB in FP16) on 8× RTX 4090 (24 GB each) would be impossible without ZeRO. With ZeRO Stage 3, each GPU stores ~17.5 GB of parameters (140/8), plus gradients and optimizer states, fitting within 24 GB. Without ZeRO, each GPU would need to hold the full 140 GB.

Workflow example

When using Hugging Face Transformers with DeepSpeed, an operator configures ZeRO in a JSON file (e.g., zero_config.json) and passes it to the training script: deepspeed --num_gpus=8 train.py --deepspeed zero_config.json. The config specifies the ZeRO stage (e.g., "zero_optimization": {"stage": 3}). During training, the runtime automatically partitions model states across GPUs, and operators monitor GPU memory usage via nvidia-smi to verify reduced per-GPU consumption.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work