Network Requirements — Local AI Clusters (Chapter 3)

Distributed LLM workloads demand network specifications that differ markedly from general-purpose clusters. Understanding these requirements prevents costly topology mistakes that degrade serving performance below theoretical compute capacity.

The critical metric is all-reduce bandwidth during gradient synchronization in training, or tensor gather operations during inference. A tensor-parallel layer with parameters distributed across 8 GPUs requires frequent all-reduce calls where each participating GPU sends and receives parameter shards. Network congestion here manifests as training stall or serving timeout.

Recommended specifications for production clusters: InfiniBand HDR (100 Gbps bidirectional) or 200GbE for smaller deployments. These provide sufficient bandwidth for medium-scale clusters up to 64 GPUs before requiring specialized topology optimizations. Beyond this scale, multiple network links per GPU and optimized collective libraries become necessary.

Common failure mode: specifying network hardware based on compute capacity alone. A server with 8 A100s has 6400 GB/s of FP32 compute throughput. Achievable collective bandwidth on 100GbE is 12.5 GB/s per GPU in each direction—a mismatch that bottlenecks tensor parallelism immediately.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.