03. Network Requirements
Distributed LLM workloads demand network specifications that differ markedly from general-purpose clusters. Understanding these requirements prevents costly topology mistakes that degrade serving performance below theoretical compute capacity.
The critical metric is all-reduce bandwidth during gradient synchronization in training, or tensor gather operations during inference. A tensor-parallel layer with parameters distributed across 8 GPUs requires frequent all-reduce calls where each participating GPU sends and receives parameter shards. Network congestion here manifests as training stall or serving timeout.
Recommended specifications for production clusters: InfiniBand HDR (100 Gbps bidirectional) or 200GbE for smaller deployments. These provide sufficient bandwidth for medium-scale clusters up to 64 GPUs before requiring specialized topology optimizations. Beyond this scale, multiple network links per GPU and optimized collective libraries become necessary.
Common failure mode: specifying network hardware based on compute capacity alone. A server with 8 A100s has 6400 GB/s of FP32 compute throughput. Achievable collective bandwidth on 100GbE is 12.5 GB/s per GPU in each direction—a mismatch that bottlenecks tensor parallelism immediately.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Calculate the network bandwidth requirement for your serving configuration. For tensor parallelism of size N with model size M bytes, each collective requires each of N GPUs to exchange M/N bytes. Multiply by your throughput target to derive required network bandwidth. Check that your planned network hardware satisfies this requirement.