RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 5
Local AI Clusters

05. Tensor Parallelism

Chapter 5 of 18 · 15 min
KEY INSIGHT

Tensor parallelism reduces per-GPU memory footprint linearly but incurs collective communication overhead for each partitioned operation. The useful degree of tensor parallelism depends on your network bandwidth—exceeding 8 GPUs typically requires combining with pipeline parallelism.

Tensor parallelism splits individual layer computations across GPUs, enabling models larger than single-GPU memory to execute across a cluster. This technique parallels matrix multiplications across devices, requiring frequent collective communication to synchronize partial results.

The core mechanism operates at the operation level. A matrix multiplication Y = XW splits W into column partitions distributed across GPUs. Each GPU computes its partial output, then participates in an all-reduce to combine results. This pattern repeats for each linear layer in transformer architectures.

Implementation in Megatron-LM-style frameworks divides attention and feedforward layers. Attention splits Q, K, V projections across tensor ranks while keeping output attention combined. Feedforward layers split across the intermediate dimension with similar all-reduce synchronization.

Scaling behavior: tensor parallelism provides linear memory reduction per GPU up to the communication overhead penalty. An 8-GPU tensor-parallel scheme reduces per-device memory by roughly 8x, enabling 8× larger models. Communication overhead from frequent all-reduce calls limits useful tensor parallelism degree—typical sweet spots are 2-8 GPUs before pipeline parallelism becomes necessary.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a tensor-parallel linear layer in PyTorch to understand the communication pattern. Distribute a matrix across 2 GPUs, compute partial results independently, then all-reduce the output. Measure communication time versus computation time to understand overhead scaling.

import torch
import torch.distributed as dist

def tensor_parallel_linear(x, weight shards, world_size):
    # Each rank computes its partial output
    partial_output = torch.matmul(x, local_weight)
    # All-reduce to combine results across ranks
    output = torch.zeros_like(partial_output)
    dist.all_reduce(partial_output, op=dist.ReduceOp.SUM)
    return partial_output
← Chapter 4
Shared Storage
Chapter 6 →
Pipeline Parallelism