RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / DeepSpeed
Hardware & infrastructure

DeepSpeed

DeepSpeed is a deep learning optimization library by Microsoft that reduces memory usage and speeds up training for large models. It introduces ZeRO (Zero Redundancy Optimizer), which partitions optimizer states, gradients, and parameters across GPUs, enabling training of models with billions of parameters on limited hardware. For operators running local AI, DeepSpeed is primarily relevant when fine-tuning large models (e.g., Llama 2 70B) on multi-GPU setups, as it can reduce per-GPU memory requirements significantly, allowing larger models or batch sizes within VRAM constraints.

Deeper dive

DeepSpeed's core innovation is ZeRO, which eliminates memory redundancy across data-parallel processes. ZeRO has three stages: Stage 1 partitions optimizer states (e.g., Adam momentum), Stage 2 also partitions gradients, and Stage 3 partitions model parameters. Stage 3 enables training models that exceed single-GPU memory by offloading parameters to CPU or NVMe when not in use. DeepSpeed also includes optimized kernels (e.g., for attention) and supports mixed-precision training. For local AI operators, DeepSpeed is most useful when fine-tuning large open-source models on multi-GPU rigs (e.g., 4x RTX 3090). However, it requires PyTorch and is not directly compatible with llama.cpp or Ollama; it is typically used with Hugging Face Transformers or custom training scripts.

Practical example

Fine-tuning Llama 2 70B with Hugging Face Transformers normally requires ~140 GB of GPU memory (using AdamW). With DeepSpeed ZeRO Stage 3, you can train on 4x RTX 3090 (24 GB each) by sharding parameters, gradients, and optimizer states across GPUs and offloading idle parameters to CPU. The command might include --deepspeed ds_config.json with a config specifying zero_optimization.stage: 3 and offload_optimizer.device: cpu.

Workflow example

In a typical fine-tuning workflow, you install DeepSpeed (pip install deepspeed), then launch training with deepspeed --num_gpus=4 train.py --deepspeed ds_config.json. The config file defines ZeRO stage and offload settings. During training, DeepSpeed logs memory savings and throughput. Operators monitoring VRAM usage via nvidia-smi will see each GPU using less memory than without DeepSpeed, at the cost of some communication overhead.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →