RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / Model Parallelism
Hardware & infrastructure

Model Parallelism

Model parallelism is a technique that splits a single neural network across multiple GPUs or other accelerators, with each device holding a subset of the model's layers or parameters. Unlike data parallelism, where each GPU has a full copy of the model and processes different data batches, model parallelism partitions the model itself. This is necessary when the model is too large to fit into the VRAM of a single GPU—for example, a 70B-parameter model at 4-bit quantization requires roughly 40 GB, exceeding the 24 GB of a consumer RTX 4090. Operators encounter model parallelism when running models larger than their GPU's VRAM, often via frameworks like vLLM or Hugging Face Transformers with device_map="auto".

Deeper dive

Model parallelism can be implemented in two main ways: layer-wise (pipeline parallelism) or tensor-wise (tensor parallelism). Pipeline parallelism assigns consecutive layers to different GPUs—e.g., layers 1-10 on GPU 0, 11-20 on GPU 1—and data flows sequentially through the pipeline. Tensor parallelism splits individual operations (like matrix multiplications) across GPUs, requiring high-bandwidth interconnects (e.g., NVLink) for efficient communication. In practice, many frameworks combine both: for instance, vLLM uses tensor parallelism for attention layers and pipeline parallelism for the full model. On consumer hardware, model parallelism is less common because multi-GPU setups with fast interconnects are rare; instead, operators often rely on CPU offloading (a form of model parallelism where some layers reside in system RAM). However, with dual RTX 3090s (24 GB each) connected via NVLink, one can run a 70B model at Q4 by splitting layers across both cards. The key trade-off is communication overhead: each GPU must wait for results from others, increasing latency per token.

Practical example

An operator with two RTX 3090s (24 GB each) wants to run Llama 3.1 70B at Q4_K_M (~40 GB total). Using llama.cpp with -ngl 40 (offload 40 layers to GPU) and -ngl 40 -ngl 40 for two GPUs, the runtime splits the model: first 40 layers on GPU 0, next 40 on GPU 1. Each token requires data transfer between GPUs via PCIe, resulting in ~10 tok/s, compared to ~30 tok/s on a single A100 (80 GB).

Workflow example

In vLLM, model parallelism is configured via the --tensor-parallel-size flag. For example, vllm serve meta-llama/Llama-3.1-70B --tensor-parallel-size 2 splits the model across two GPUs. The runtime automatically partitions attention heads and feed-forward networks. Operators monitor GPU memory with nvidia-smi to ensure each card uses less than its VRAM limit. In Hugging Face Transformers, model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B", device_map="auto") uses the accelerate library to distribute layers across available GPUs and CPU.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →