Distributed10 GbEexpert

Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

By Fredoline Eruo · Reviewed 2026-05-06

Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →

Memory budget

Total VRAM

192 GB

Effective for inference

80 GB

42% of total

Not pooled

Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.

Topology

distributed

Interconnect

ethernet-10g~10 GB/s

Component count

8 units

Components

8×rtx-4090

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

Ray Serve vLLM SGLang

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Request routingTensor parallel

Why this combo

Ray Serve multi-node distributed is the production orchestration pattern for high-concurrency LLM serving. The use cases:

4-32 concurrent agent loops requiring high aggregate throughput
Multi-tenant inference with autoscaling
Resilient serving with graceful node-failure handling
Multi-model deployment with request routing

What it's NOT for:

Fitting a single model larger than per-replica capacity
Single-user / low-concurrency workloads
Anything where operational simplicity matters more than scale

Runtime compatibility

Ray Serve ✓ the reference orchestration layer.
vLLM ✓ excellent inside Ray Serve replicas.
SGLang ✓ excellent — particularly when prefix-cache hit rate is high.
TGI ✓ HuggingFace's serving alternative; works inside Ray Serve.

Distinguishing from SGLang distributed

This combo and /hardware-combos/sglang-distributed-multi-node (when added) solve different problems:

Ray Serve: replica orchestration. Each replica is independent. Scales throughput.
SGLang distributed: model sharding. The model spans nodes. Scales model size.

If you need 'I want to serve 100 users on 70B,' Ray Serve. If you need 'I want to serve 5 users on 235B,' SGLang distributed.

Setup notes

10 GbE minimum between nodes; 25 GbE recommended
Static IPs on a dedicated subnet for inference traffic
Object storage (S3 / MinIO) for shared model weights
Prometheus + Grafana for replica monitoring
Plan ~1-2 weeks for production setup

/stacks/distributed-inference-homelab — full deployment recipe
/systems/distributed-inference — architectural depth
/tools/ray-serve — runtime operational review
/guides/running-local-ai-on-multiple-gpus-2026 — buying guide

Best model classes

70B serving at high concurrency — 4 replicas × 2× tensor-parallel = 4 concurrent serving paths, each at full 70B Q4 capacity.
30-50B with autoscaling — Ray Serve scales replicas up/down based on request queue.
Multi-model serving — different replicas can host different models, routing by request metadata.

The combo is built for throughput aggregation and operational flexibility, not for fitting a single very-large model.

What this combo is bad at

>70B single-model serving — each replica is capped at ~45 GB. For 100B+ models, use SGLang distributed or single-node multi-GPU.
Cost-efficient single-user inference — 8 cards is overkill for solo use; single dual-4090 covers it.
Tight-latency single-stream — request routing adds ~5-15ms vs single-machine deployment.

Who should avoid this

Anyone whose model fits on 1-2 cards — operational complexity isn't worth it.
Hobbyists — Ray Serve is production tooling; the learning curve is steep.
Single-user workloads — replica orchestration adds zero value for one user.

Power & thermal

~3600W peak

4 nodes × 2× 4090 each = 8 cards across 4 chassis. Distributed thermal load is easier than concentrated, but each chassis still produces ~900W of heat. Plan datacenter or large server-room placement.

Reliability

Ray Serve is the production-grade orchestration layer; node failure is handled gracefully via replica re-routing. Network reliability is the dominant failure mode; 10 GbE minimum, 25 GbE recommended for production.

Recommended OS

Ubuntu 22.04 LTS or 24.04 LTS on all nodes; Docker / Kubernetes deployment recommended.

Operator warning — failure modes

Failure modes specific to Ray Serve multi-node distributed

Network partition during request. A dropped network packet mid-stream can hang or fail requests. Ray Serve handles via retries but tail latency suffers.
Replica configuration drift. Different nodes running slightly different vLLM versions / model weights produce inconsistent responses. Strict version pinning + automated deployment is mandatory.
Autoscaling thrashing. Aggressive scale-up + scale-down policies can produce constant replica churn, killing GPU memory load efficiency. Tune scale-up / scale-down thresholds carefully.
Head-of-line blocking on single replica. A slow request on one replica can back up its queue while others sit idle. Ray Serve's load balancing helps but doesn't eliminate this.
Cross-replica KV cache misses. Each replica maintains its own KV cache; sticky-routing improves cache hit rate but adds head-of-line risk. RadixAttention via SGLang helps but requires per-replica configuration.
Storage bottleneck on cold start. All replicas pulling the same model weights from object storage on cold start can saturate the storage backend. Pre-warm replicas before traffic surges.

Closest alternative

Vllm Tensor Parallel H100 Workstation →

Different problems: Ray Serve scales request throughput via replicas; H100 TP-4 scales single-stream model size + latency. Pick by whether you need many users or one big model.

Featured in stack

Distributed inference homelab →

Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

qwen-3-32b

pending

Estimate: 30-45 tok/s per stream × 4-32 concurrent

Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.

Going deeper

All hardware combinations — browse other multi-GPU and multi-machine setups.
Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
Distributed inference systems — architectural depth.