Hardware & infrastructure

FSDP (Fully Sharded Data Parallel)

FSDP (Fully Sharded Data Parallel) is a distributed training technique that shards model parameters, gradients, and optimizer states across multiple GPUs, reducing per-GPU memory usage. Unlike standard data parallelism, where each GPU holds a full copy of the model, FSDP partitions the model so each GPU stores only a fraction. During forward/backward passes, FSDP gathers the necessary shards on-the-fly, then discards them. This allows training models that would otherwise exceed a single GPU's VRAM. Operators encounter FSDP when fine-tuning large models (e.g., Llama 70B) across multiple consumer GPUs, where VRAM per card is limited (e.g., 16-24 GB).

Deeper dive

FSDP, introduced by PyTorch, is inspired by ZeRO (Zero Redundancy Optimizer) Stage 3 from DeepSpeed. It operates by sharding all three model states: parameters, gradients, and optimizer states. During training, each GPU holds only its assigned shard. Before a layer's forward pass, FSDP all-gathers the full parameters for that layer from all GPUs, computes the forward pass, then discards non-sharded parameters. The backward pass similarly gathers gradients. This reduces per-GPU memory from O(model_size) to O(model_size / num_gpus), enabling training of 70B+ parameter models on 4-8 consumer GPUs. The trade-off is increased communication overhead (all-gather and reduce-scatter operations) and slightly higher computation due to recomputation. FSDP can be configured with different sharding strategies: full sharding (ZeRO-3), sharding only optimizer states (ZeRO-1), or hybrid sharding. Operators using Hugging Face Transformers or PyTorch can enable FSDP with a simple config flag, but must ensure high-bandwidth interconnects (NVLink, InfiniBand) to avoid communication bottlenecks.

Practical example

Fine-tuning Llama 3.1 70B (140 GB in FP16) on four RTX 4090s (24 GB each) would be impossible with standard data parallelism (each GPU needs 140 GB). With FSDP, each GPU stores 35 GB of parameters, plus gradients and optimizer states (70 GB total per GPU). Using mixed precision (BF16) and activation checkpointing, per-GPU memory drops to ~20 GB, fitting within 24 GB. The training throughput might be ~500 tokens/sec with NVLink, but without it, inter-GPU communication could halve that.

Workflow example

In Hugging Face Transformers, enable FSDP by adding --fsdp 'full_shard auto_wrap' to the training script. For example: python run_clm.py --model_name meta-llama/Llama-2-70b --fsdp 'full_shard auto_wrap' --per_device_train_batch_size 1 --gradient_accumulation_steps 8. The runtime automatically wraps layers with FSDP and handles sharding. In PyTorch, wrap the model with FullyShardedDataParallel(model) and use sharding_strategy=ShardingStrategy.FULL_SHARD. Operators must set torch.distributed.init_process_group with NCCL backend for GPU communication.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work