Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)
Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.
Topology
- 8×rtx-4090
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo
Ray Serve multi-node distributed is the production orchestration pattern for high-concurrency LLM serving. The use cases:
- 4-32 concurrent agent loops requiring high aggregate throughput
- Multi-tenant inference with autoscaling
- Resilient serving with graceful node-failure handling
- Multi-model deployment with request routing
What it's NOT for:
- Fitting a single model larger than per-replica capacity
- Single-user / low-concurrency workloads
- Anything where operational simplicity matters more than scale
Runtime compatibility
- Ray Serve ✓ the reference orchestration layer.
- vLLM ✓ excellent inside Ray Serve replicas.
- SGLang ✓ excellent — particularly when prefix-cache hit rate is high.
- TGI ✓ HuggingFace's serving alternative; works inside Ray Serve.
Distinguishing from SGLang distributed
This combo and /hardware-combos/sglang-distributed-multi-node (when added) solve different problems:
- Ray Serve: replica orchestration. Each replica is independent. Scales throughput.
- SGLang distributed: model sharding. The model spans nodes. Scales model size.
If you need 'I want to serve 100 users on 70B,' Ray Serve. If you need 'I want to serve 5 users on 235B,' SGLang distributed.
Setup notes
- 10 GbE minimum between nodes; 25 GbE recommended
- Static IPs on a dedicated subnet for inference traffic
- Object storage (S3 / MinIO) for shared model weights
- Prometheus + Grafana for replica monitoring
- Plan ~1-2 weeks for production setup
Related
- /stacks/distributed-inference-homelab — full deployment recipe
- /systems/distributed-inference — architectural depth
- /tools/ray-serve — runtime operational review
- /guides/running-local-ai-on-multiple-gpus-2026 — buying guide
Best model classes
- 70B serving at high concurrency — 4 replicas × 2× tensor-parallel = 4 concurrent serving paths, each at full 70B Q4 capacity.
- 30-50B with autoscaling — Ray Serve scales replicas up/down based on request queue.
- Multi-model serving — different replicas can host different models, routing by request metadata.
The combo is built for throughput aggregation and operational flexibility, not for fitting a single very-large model.
What this combo is bad at
- >70B single-model serving — each replica is capped at ~45 GB. For 100B+ models, use SGLang distributed or single-node multi-GPU.
- Cost-efficient single-user inference — 8 cards is overkill for solo use; single dual-4090 covers it.
- Tight-latency single-stream — request routing adds ~5-15ms vs single-machine deployment.
Who should avoid this
- Anyone whose model fits on 1-2 cards — operational complexity isn't worth it.
- Hobbyists — Ray Serve is production tooling; the learning curve is steep.
- Single-user workloads — replica orchestration adds zero value for one user.
4 nodes × 2× 4090 each = 8 cards across 4 chassis. Distributed thermal load is easier than concentrated, but each chassis still produces ~900W of heat. Plan datacenter or large server-room placement.
Ray Serve is the production-grade orchestration layer; node failure is handled gracefully via replica re-routing. Network reliability is the dominant failure mode; 10 GbE minimum, 25 GbE recommended for production.
Ubuntu 22.04 LTS or 24.04 LTS on all nodes; Docker / Kubernetes deployment recommended.
Failure modes specific to Ray Serve multi-node distributed
- Network partition during request. A dropped network packet mid-stream can hang or fail requests. Ray Serve handles via retries but tail latency suffers.
- Replica configuration drift. Different nodes running slightly different vLLM versions / model weights produce inconsistent responses. Strict version pinning + automated deployment is mandatory.
- Autoscaling thrashing. Aggressive scale-up + scale-down policies can produce constant replica churn, killing GPU memory load efficiency. Tune scale-up / scale-down thresholds carefully.
- Head-of-line blocking on single replica. A slow request on one replica can back up its queue while others sit idle. Ray Serve's load balancing helps but doesn't eliminate this.
- Cross-replica KV cache misses. Each replica maintains its own KV cache; sticky-routing improves cache hit rate but adds head-of-line risk. RadixAttention via SGLang helps but requires per-replica configuration.
- Storage bottleneck on cold start. All replicas pulling the same model weights from object storage on cold start can saturate the storage backend. Pre-warm replicas before traffic surges.
Vllm Tensor Parallel H100 Workstation →
Different problems: Ray Serve scales request throughput via replicas; H100 TP-4 scales single-stream model size + latency. Pick by whether you need many users or one big model.
Distributed inference homelab →
Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)
qwen-3-32bRay Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.