Build a distributed inference homelab stack (May 2026)
Run 70B-405B class models across 2-4 GPU machines on a controlled LAN. Real interconnect requirements; real monitoring; real failure modes. The path beyond 'just buy a bigger card.'
- 01ToolInference engine (TP within node, PP across nodes)vllm
vLLM over SGLang for distributed homelab: better-tested multi-node TP+PP path, broader kernel coverage on Hopper / Blackwell, and the Ray integration is first-class. SGLang's RadixAttention advantage applies but the multi-node story is younger; pick vLLM unless your traffic is heavily prefix-shared agent loops.
- 02ToolCluster orchestrator (head node + worker placement)ray-serve
Ray Serve is the canonical orchestration layer above vLLM in distributed deployments. Handles worker placement, autoscaling, traffic splitting, canary deploys. Same Ray cluster scales to add SGLang or other engines later — pick the orchestrator first; pick the engine inside it.
- 03ToolFrontend (monitoring + chat surface)openwebui
Open WebUI provides the user-facing chat surface AND a built-in usage dashboard — most homelab operators end up wanting both anyway. Talks to Ray Serve's OpenAI-compatible endpoint with no adapter.
- 04ToolOptional RAG layer for the household / teamanythingllm
AnythingLLM is optional but pairs naturally — point it at the cluster's serving endpoint and you get RAG-over-private-docs on top of distributed inference. Add when the cluster is stable.
When distributed inference homelab actually makes sense
Read this section before reading any further. The honest answer for most readers asking about distributed inference is buy a bigger GPU instead. Multi-GPU multi-node inference is hardware-and-ops complexity, and unless one of the three following conditions holds, it's usually a worse choice than just buying or renting a single H100 / MI300X / 5090:
- The model literally won't fit on a single buyable card. 405B+ class models. 671B reasoning models. Even with 4-bit quants, you're past the single-card ceiling.
- You already own multiple GPUs that you're otherwise underusing. 2x 4090 sitting idle in two different machines is a different cost calculus than buying a fresh 5090.
- You have a hard data-residency requirement that rules out cloud. A single bigger card costs $5K+; a multi-node cluster doubles that, but for some industries the cluster is the only path forward at all.
See /systems/distributed-inference for the architectural argument behind these conditions, with the latency math that makes consumer Ethernet the bottleneck and the 5 reference stacks that the distributed-inference ecosystem actually has.
Networking assumptions
The single most-underestimated requirement for this stack: interconnect bandwidth between nodes. From the system guide:
- NVLink (within a node, datacenter SKUs): ~600 GB/s — perfect; tensor parallelism scales linearly.
- InfiniBand (between nodes): ~25-100 GB/s practical depending on tier. Acceptable for multi-node TP.
- 100 Gbps Ethernet: ~12.5 GB/s practical. Borderline acceptable for pipeline parallel; loses 30-50% throughput vs InfiniBand on tensor parallel.
- 10 Gbps Ethernet: ~1.25 GB/s. Pipeline parallel only; tensor parallel becomes worse than running on a single card.
- 1 Gbps Ethernet: Don't. Just don't.
For a homelab cluster, 100 Gbps Ethernet is the minimum credible interconnect. Switches are expensive but used datacenter gear (Mellanox SN2700, Arista 7050X3) is available second-hand at ~$1000-2000.
Step-by-step setup
1. Set up the Ray cluster (head node + workers)
# On the head node:
pip install ray[default] vllm
# Start the head with a known port + dashboard
ray start --head --port=6379 --dashboard-host=0.0.0.0
# On every worker node (assuming head is at 192.168.10.1):
pip install ray[default] vllm
ray start --address=192.168.10.1:6379
# Verify cluster health from the head node
ray status
# Expected: Head node + N worker nodes, each reporting GPU countNCCL configuration is non-optional for inter-node tensor parallel. Set NCCL_DEBUG=INFO, NCCL_IB_HCA, and NCCL_SOCKET_IFNAME on every node. Plan to spend a half-day debugging the first time — this is normal.
2. Launch vLLM with multi-node TP+PP
# On the head node, with 4 nodes × 2 GPUs each = TP=2, PP=4
vllm serve meta-llama/Llama-3.1-405B-Instruct-AWQ \
--tensor-parallel-size 2 \
--pipeline-parallel-size 4 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--enable-chunked-prefill \
--port 8000 \
--host 0.0.0.0The --distributed-executor-backend ray flag is what tells vLLM to use the Ray cluster for worker placement. Multi-node deployment requires Ray; multiprocessing backend is single-node only. The first cluster start takes 2-5 minutes to initialize; subsequent restarts are faster.
3. Wire Ray Serve in front for autoscaling + traffic mgmt
# ray_serve_config.py
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config={"model_id": "meta-llama/Llama-3.1-405B-Instruct-AWQ"},
accelerator_type="A100-80G",
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1 # 405B uses the entire cluster; 1 replica
}
},
runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
engine_kwargs={
"tensor_parallel_size": 2,
"pipeline_parallel_size": 4,
"max_model_len": 32768,
"enable_chunked_prefill": True,
},
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, route_prefix="/")4. Install Open WebUI as the frontend on a separate box
# Open WebUI runs on the household / office network
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URLS="http://ray-head.local:8000/v1" \
-e OPENAI_API_KEYS="any-string" \
ghcr.io/open-webui/open-webui:latestMonitoring
A distributed cluster without monitoring is a black box that fails silently. The minimum you should run:
- Ray Dashboard at port 8265 on the head node. Real-time GPU utilization, request queue depth, worker health.
- Prometheus + Grafana for vLLM metrics. vLLM exposes
/metricsendpoint with all the production-relevant counters: requests/s, TTFT, KV cache occupancy, prefix-cache hit rate. - Network monitoring on the interconnect.
iftopor smokeping between nodes — distributed inference fails most often via interconnect degradation, not GPU failure. Catch it early. - nvidia-smi metrics on every node. Pin a DCGM exporter; alert on temperature > 85°C, ECC errors, or clock throttling.
Failure modes you'll hit
- NCCL hang on cluster startup. The most common single failure. Inter-node TP with non-uniform NIC settings deadlocks at init. Verify with
NCCL_DEBUG=INFOoutput; pinNCCL_IB_HCAandNCCL_SOCKET_IFNAMEon every node. - Pipeline bubble starvation at low concurrency. PP needs many requests in flight to keep stages busy. Single-user homelab traffic leaves stages idle most of the time. Either keep more requests in flight (Open WebUI multi-tab is enough; concurrent users is better) or accept that PP is a memory-fitting strategy at low QPS, not a throughput strategy.
- Ray head node single-point-of-failure. Lose the head, lose the cluster. Plan for HA via Ray's managed services tier or accept the failure mode.
- Network MTU mismatch. Default 1500; high- throughput RDMA wants 9000 (jumbo frames). Mismatched MTUs across switches cause silent packet drops and throughput regression.
- PCIe topology asymmetry. Within each node, two GPUs on the same root complex have different bandwidth than two on different root complexes. Verify with
nvidia-smi topo -m; mismatched ranks halve throughput. - Power tripping at full cluster load. A 4-node cluster with 8 GPUs total can pull 3 kW under inference. Most homelab circuits handle 1.5-2 kW. UPS + careful circuit planning are not optional.
- Thermal throttling. Datacenter SKUs assume datacenter cooling. A homelab rack with insufficient airflow throttles GPUs at sustained load. Monitor temps; add chassis fans or a dedicated rack AC if temps climb past 80°C.
Variations and alternatives
Apple Silicon variation. Replace the entire stack with Exo on 4-8 M4 Pro Mac Minis with Thunderbolt 5 RDMA. Substantially less power, substantially less complexity, but also lower aggregate throughput. See the Apple Silicon AI stack for that path.
SGLang variation. Replace vLLM with SGLang if your cluster serves heavily-shared-prefix workloads (agent loops with stable system prompts). RadixAttention's cross-replica prefix cache is more valuable at cluster scale than per-replica throughput.
Petals variation (if you really cannot afford a cluster). Petals shards a model across volunteer hosts on the public internet — a single client with one GPU runs the input/output layers. Slower, but works on a laptop. Privacy unsuitable for any sensitive workload.
Who should avoid this stack
- Anyone whose model fits on a single 80GB+ card. Even with 405B-class targets, AWQ-INT4 fits H200 / MI300X. Buy the bigger card. The cluster complexity is a tax with no benefit at single-card-fits scale.
- Anyone without 100 Gbps interconnect. On 10 Gbps Ethernet the cluster is worse than running on one node with offload. The interconnect IS the architecture.
- Anyone running inference on a residential power circuit. 3 kW continuous on a single 15A 120V circuit is 80% of capacity, with no headroom for the rest of your house. A dedicated 240V circuit is realistic; a shared homelab circuit is not.
- Anyone who reads “ops complexity” as a feature. This stack rewards platform-engineering expertise; it punishes “we'll figure it out as we go.” If you don't have NCCL + InfiniBand + Ray-cluster operational experience or a strong willingness to acquire it, the cloud is faster, cheaper, and friendlier.
Going deeper
- /systems/distributed-inference — the architectural depth this stack assumes you've read. Especially the latency-math section.
- vLLM operational review — the runtime-specific operator detail (the
gpu_memory_utilizationknob, prefix cache invalidation, multi-LoRA crosstalk). - SGLang operational review — when its architectural advantage compounds at multi-node scale.
- Inference runtime ecosystem map — distributed serving zone with the alternatives.