RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 3
Local AI Clusters

03. Network Requirements

Chapter 3 of 18 · 10 min
KEY INSIGHT

Collective operation bandwidth is often the limiting factor in multi-node LLM serving, not raw GPU compute. Network specifications must match the collective communication demands of your parallelism strategy.

Distributed LLM workloads demand network specifications that differ markedly from general-purpose clusters. Understanding these requirements prevents costly topology mistakes that degrade serving performance below theoretical compute capacity.

The critical metric is all-reduce bandwidth during gradient synchronization in training, or tensor gather operations during inference. A tensor-parallel layer with parameters distributed across 8 GPUs requires frequent all-reduce calls where each participating GPU sends and receives parameter shards. Network congestion here manifests as training stall or serving timeout.

Recommended specifications for production clusters: InfiniBand HDR (100 Gbps bidirectional) or 200GbE for smaller deployments. These provide sufficient bandwidth for medium-scale clusters up to 64 GPUs before requiring specialized topology optimizations. Beyond this scale, multiple network links per GPU and optimized collective libraries become necessary.

Common failure mode: specifying network hardware based on compute capacity alone. A server with 8 A100s has 6400 GB/s of FP32 compute throughput. Achievable collective bandwidth on 100GbE is 12.5 GB/s per GPU in each direction—a mismatch that bottlenecks tensor parallelism immediately.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Calculate the network bandwidth requirement for your serving configuration. For tensor parallelism of size N with model size M bytes, each collective requires each of N GPUs to exchange M/N bytes. Multiply by your throughput target to derive required network bandwidth. Check that your planned network hardware satisfies this requirement.

← Chapter 2
Cluster Topology
Chapter 4 →
Shared Storage