RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / FSDP (Fully Sharded Data Parallel)
Hardware & infrastructure

FSDP (Fully Sharded Data Parallel)

FSDP (Fully Sharded Data Parallel) is a distributed training technique that shards model parameters, gradients, and optimizer states across multiple GPUs, reducing per-GPU memory usage. Unlike standard data parallelism, where each GPU holds a full copy of the model, FSDP partitions the model so each GPU stores only a fraction. During forward/backward passes, FSDP gathers the necessary shards on-the-fly, then discards them. This allows training models that would otherwise exceed a single GPU's VRAM. Operators encounter FSDP when fine-tuning large models (e.g., Llama 70B) across multiple consumer GPUs, where VRAM per card is limited (e.g., 16-24 GB).

Deeper dive

FSDP, introduced by PyTorch, is inspired by ZeRO (Zero Redundancy Optimizer) Stage 3 from DeepSpeed. It operates by sharding all three model states: parameters, gradients, and optimizer states. During training, each GPU holds only its assigned shard. Before a layer's forward pass, FSDP all-gathers the full parameters for that layer from all GPUs, computes the forward pass, then discards non-sharded parameters. The backward pass similarly gathers gradients. This reduces per-GPU memory from O(model_size) to O(model_size / num_gpus), enabling training of 70B+ parameter models on 4-8 consumer GPUs. The trade-off is increased communication overhead (all-gather and reduce-scatter operations) and slightly higher computation due to recomputation. FSDP can be configured with different sharding strategies: full sharding (ZeRO-3), sharding only optimizer states (ZeRO-1), or hybrid sharding. Operators using Hugging Face Transformers or PyTorch can enable FSDP with a simple config flag, but must ensure high-bandwidth interconnects (NVLink, InfiniBand) to avoid communication bottlenecks.

Practical example

Fine-tuning Llama 3.1 70B (140 GB in FP16) on four RTX 4090s (24 GB each) would be impossible with standard data parallelism (each GPU needs 140 GB). With FSDP, each GPU stores 35 GB of parameters, plus gradients and optimizer states (70 GB total per GPU). Using mixed precision (BF16) and activation checkpointing, per-GPU memory drops to ~20 GB, fitting within 24 GB. The training throughput might be ~500 tokens/sec with NVLink, but without it, inter-GPU communication could halve that.

Workflow example

In Hugging Face Transformers, enable FSDP by adding --fsdp 'full_shard auto_wrap' to the training script. For example: python run_clm.py --model_name meta-llama/Llama-2-70b --fsdp 'full_shard auto_wrap' --per_device_train_batch_size 1 --gradient_accumulation_steps 8. The runtime automatically wraps layers with FSDP and handles sharding. In PyTorch, wrap the model with FullyShardedDataParallel(model) and use sharding_strategy=ShardingStrategy.FULL_SHARD. Operators must set torch.distributed.init_process_group with NCCL backend for GPU communication.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →