RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / ZeRO optimizer
Hardware & infrastructure

ZeRO optimizer

ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for distributed training of large models. It partitions optimizer states, gradients, and parameters across multiple GPUs to reduce memory redundancy, enabling training of models with billions of parameters on clusters with limited per-GPU memory. Operators encounter ZeRO when using Hugging Face Transformers or DeepSpeed to train models like Llama 2 70B across multiple GPUs, as it allows fitting the model without requiring each GPU to hold a full copy of all parameters.

Deeper dive

ZeRO operates in three stages: Stage 1 partitions optimizer states (e.g., Adam momentum and variance) across GPUs, reducing memory per GPU by up to 4x. Stage 2 additionally partitions gradients, further reducing memory. Stage 3 partitions model parameters themselves, so each GPU holds only a fraction of the parameters at any time, fetching them on-demand during forward/backward passes. This enables training models with hundreds of billions of parameters on hundreds of GPUs. For operators, ZeRO is typically used via DeepSpeed or PyTorch FSDP (Fully Sharded Data Parallel). The trade-off is increased communication overhead, which can slow training if network bandwidth is limited. On consumer hardware with a single GPU, ZeRO is not applicable; it is designed for multi-GPU setups.

Practical example

Training Llama 2 70B (140 GB in FP16) on 8× RTX 4090 (24 GB each) would be impossible without ZeRO. With ZeRO Stage 3, each GPU stores ~17.5 GB of parameters (140/8), plus gradients and optimizer states, fitting within 24 GB. Without ZeRO, each GPU would need to hold the full 140 GB.

Workflow example

When using Hugging Face Transformers with DeepSpeed, an operator configures ZeRO in a JSON file (e.g., zero_config.json) and passes it to the training script: deepspeed --num_gpus=8 train.py --deepspeed zero_config.json. The config specifies the ZeRO stage (e.g., "zero_optimization": {"stage": 3}). During training, the runtime automatically partitions model states across GPUs, and operators monitor GPU memory usage via nvidia-smi to verify reduced per-GPU consumption.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →