Model Parallelism — AI glossary

Model parallelism is a technique that splits a single neural network across multiple GPUs or other accelerators, with each device holding a subset of the model's layers or parameters. Unlike data parallelism, where each GPU has a full copy of the model and processes different data batches, model parallelism partitions the model itself. This is necessary when the model is too large to fit into the VRAM of a single GPU—for example, a 70B-parameter model at 4-bit quantization requires roughly 40 GB, exceeding the 24 GB of a consumer RTX 4090. Operators encounter model parallelism when running models larger than their GPU's VRAM, often via frameworks like vLLM or Hugging Face Transformers with device_map="auto".

Deeper dive

Model parallelism can be implemented in two main ways: layer-wise (pipeline parallelism) or tensor-wise (tensor parallelism). Pipeline parallelism assigns consecutive layers to different GPUs—e.g., layers 1-10 on GPU 0, 11-20 on GPU 1—and data flows sequentially through the pipeline. Tensor parallelism splits individual operations (like matrix multiplications) across GPUs, requiring high-bandwidth interconnects (e.g., NVLink) for efficient communication. In practice, many frameworks combine both: for instance, vLLM uses tensor parallelism for attention layers and pipeline parallelism for the full model. On consumer hardware, model parallelism is less common because multi-GPU setups with fast interconnects are rare; instead, operators often rely on CPU offloading (a form of model parallelism where some layers reside in system RAM). However, with dual RTX 3090s (24 GB each) connected via NVLink, one can run a 70B model at Q4 by splitting layers across both cards. The key trade-off is communication overhead: each GPU must wait for results from others, increasing latency per token.

Practical example

An operator with two RTX 3090s (24 GB each) wants to run Llama 3.1 70B at Q4_K_M (~40 GB total). Using llama.cpp with -ngl 40 (offload 40 layers to GPU) and -ngl 40 -ngl 40 for two GPUs, the runtime splits the model: first 40 layers on GPU 0, next 40 on GPU 1. Each token requires data transfer between GPUs via PCIe, resulting in ~10 tok/s, compared to ~30 tok/s on a single A100 (80 GB).

Workflow example

In vLLM, model parallelism is configured via the --tensor-parallel-size flag. For example, vllm serve meta-llama/Llama-3.1-70B --tensor-parallel-size 2 splits the model across two GPUs. The runtime automatically partitions attention heads and feed-forward networks. Operators monitor GPU memory with nvidia-smi to ensure each card uses less than its VRAM limit. In Hugging Face Transformers, model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B", device_map="auto") uses the accelerate library to distribute layers across available GPUs and CPU.