Build a distributed inference homelab stack (May 2026) — vLLM + Ray Serve + 2-4 nodes

When distributed inference homelab actually makes sense

Read this section before reading any further. The honest answer for most readers asking about distributed inference is buy a bigger GPU instead. Multi-GPU multi-node inference is hardware-and-ops complexity, and unless one of the three following conditions holds, it's usually a worse choice than just buying or renting a single H100 / MI300X / 5090:

The model literally won't fit on a single buyable card. 405B+ class models. 671B reasoning models. Even with 4-bit quants, you're past the single-card ceiling.
You already own multiple GPUs that you're otherwise underusing. 2x 4090 sitting idle in two different machines is a different cost calculus than buying a fresh 5090.
You have a hard data-residency requirement that rules out cloud. A single bigger card costs $5K+; a multi-node cluster doubles that, but for some industries the cluster is the only path forward at all.

See /systems/distributed-inference for the architectural argument behind these conditions, with the latency math that makes consumer Ethernet the bottleneck and the 5 reference stacks that the distributed-inference ecosystem actually has.

Networking assumptions

The single most-underestimated requirement for this stack: interconnect bandwidth between nodes. From the system guide:

NVLink (within a node, datacenter SKUs): ~600 GB/s — perfect; tensor parallelism scales linearly.
InfiniBand (between nodes): ~25-100 GB/s practical depending on tier. Acceptable for multi-node TP.
100 Gbps Ethernet: ~12.5 GB/s practical. Borderline acceptable for pipeline parallel; loses 30-50% throughput vs InfiniBand on tensor parallel.
10 Gbps Ethernet: ~1.25 GB/s. Pipeline parallel only; tensor parallel becomes worse than running on a single card.
1 Gbps Ethernet: Don't. Just don't.

For a homelab cluster, 100 Gbps Ethernet is the minimum credible interconnect. Switches are expensive but used datacenter gear (Mellanox SN2700, Arista 7050X3) is available second-hand at ~$1000-2000.

Step-by-step setup

1. Set up the Ray cluster (head node + workers)

# On the head node:
pip install ray[default] vllm

# Start the head with a known port + dashboard
ray start --head --port=6379 --dashboard-host=0.0.0.0

# On every worker node (assuming head is at 192.168.10.1):
pip install ray[default] vllm
ray start --address=192.168.10.1:6379

# Verify cluster health from the head node
ray status
# Expected: Head node + N worker nodes, each reporting GPU count

NCCL configuration is non-optional for inter-node tensor parallel. Set NCCL_DEBUG=INFO, NCCL_IB_HCA, and NCCL_SOCKET_IFNAME on every node. Plan to spend a half-day debugging the first time — this is normal.

2. Launch vLLM with multi-node TP+PP

# On the head node, with 4 nodes × 2 GPUs each = TP=2, PP=4
vllm serve meta-llama/Llama-3.1-405B-Instruct-AWQ \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 4 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --port 8000 \
  --host 0.0.0.0

The --distributed-executor-backend ray flag is what tells vLLM to use the Ray cluster for worker placement. Multi-node deployment requires Ray; multiprocessing backend is single-node only. The first cluster start takes 2-5 minutes to initialize; subsequent restarts are faster.

3. Wire Ray Serve in front for autoscaling + traffic mgmt

# ray_serve_config.py
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-405B-Instruct-AWQ"},
    accelerator_type="A100-80G",
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1  # 405B uses the entire cluster; 1 replica
        }
    },
    runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
    engine_kwargs={
        "tensor_parallel_size": 2,
        "pipeline_parallel_size": 4,
        "max_model_len": 32768,
        "enable_chunked_prefill": True,
    },
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, route_prefix="/")

4. Install Open WebUI as the frontend on a separate box

# Open WebUI runs on the household / office network
docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="http://ray-head.local:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  ghcr.io/open-webui/open-webui:latest

Monitoring

A distributed cluster without monitoring is a black box that fails silently. The minimum you should run:

Ray Dashboard at port 8265 on the head node. Real-time GPU utilization, request queue depth, worker health.
Prometheus + Grafana for vLLM metrics. vLLM exposes /metrics endpoint with all the production-relevant counters: requests/s, TTFT, KV cache occupancy, prefix-cache hit rate.
Network monitoring on the interconnect. iftop or smokeping between nodes — distributed inference fails most often via interconnect degradation, not GPU failure. Catch it early.
nvidia-smi metrics on every node. Pin a DCGM exporter; alert on temperature > 85°C, ECC errors, or clock throttling.

Failure modes you'll hit

NCCL hang on cluster startup. The most common single failure. Inter-node TP with non-uniform NIC settings deadlocks at init. Verify with NCCL_DEBUG=INFO output; pin NCCL_IB_HCA and NCCL_SOCKET_IFNAME on every node.
Pipeline bubble starvation at low concurrency. PP needs many requests in flight to keep stages busy. Single-user homelab traffic leaves stages idle most of the time. Either keep more requests in flight (Open WebUI multi-tab is enough; concurrent users is better) or accept that PP is a memory-fitting strategy at low QPS, not a throughput strategy.
Ray head node single-point-of-failure. Lose the head, lose the cluster. Plan for HA via Ray's managed services tier or accept the failure mode.
Network MTU mismatch. Default 1500; high- throughput RDMA wants 9000 (jumbo frames). Mismatched MTUs across switches cause silent packet drops and throughput regression.
PCIe topology asymmetry. Within each node, two GPUs on the same root complex have different bandwidth than two on different root complexes. Verify with nvidia-smi topo -m; mismatched ranks halve throughput.
Power tripping at full cluster load. A 4-node cluster with 8 GPUs total can pull 3 kW under inference. Most homelab circuits handle 1.5-2 kW. UPS + careful circuit planning are not optional.
Thermal throttling. Datacenter SKUs assume datacenter cooling. A homelab rack with insufficient airflow throttles GPUs at sustained load. Monitor temps; add chassis fans or a dedicated rack AC if temps climb past 80°C.

Variations and alternatives

Apple Silicon variation. Replace the entire stack with Exo on 4-8 M4 Pro Mac Minis with Thunderbolt 5 RDMA. Substantially less power, substantially less complexity, but also lower aggregate throughput. See the Apple Silicon AI stack for that path.

SGLang variation. Replace vLLM with SGLang if your cluster serves heavily-shared-prefix workloads (agent loops with stable system prompts). RadixAttention's cross-replica prefix cache is more valuable at cluster scale than per-replica throughput.

Petals variation (if you really cannot afford a cluster). Petals shards a model across volunteer hosts on the public internet — a single client with one GPU runs the input/output layers. Slower, but works on a laptop. Privacy unsuitable for any sensitive workload.

Who should avoid this stack

Anyone whose model fits on a single 80GB+ card. Even with 405B-class targets, AWQ-INT4 fits H200 / MI300X. Buy the bigger card. The cluster complexity is a tax with no benefit at single-card-fits scale.
Anyone without 100 Gbps interconnect. On 10 Gbps Ethernet the cluster is worse than running on one node with offload. The interconnect IS the architecture.
Anyone running inference on a residential power circuit. 3 kW continuous on a single 15A 120V circuit is 80% of capacity, with no headroom for the rest of your house. A dedicated 240V circuit is realistic; a shared homelab circuit is not.
Anyone who reads “ops complexity” as a feature. This stack rewards platform-engineering expertise; it punishes “we'll figure it out as we go.” If you don't have NCCL + InfiniBand + Ray-cluster operational experience or a strong willingness to acquire it, the cloud is faster, cheaper, and friendlier.

Going deeper

/systems/distributed-inference — the architectural depth this stack assumes you've read. Especially the latency-math section.
vLLM operational review — the runtime-specific operator detail (the gpu_memory_utilization knob, prefix cache invalidation, multi-LoRA crosstalk).
SGLang operational review — when its architectural advantage compounds at multi-node scale.
Inference runtime ecosystem map — distributed serving zone with the alternatives.