RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Hardware combinations
  4. /Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)
Distributed10 GbEexpert

Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

By Fredoline Eruo · Reviewed 2026-05-06
Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →
Memory budget
Total VRAM
192 GB
Effective for inference
80 GB
42% of total
Not pooled

Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.

Topology

Topology
distributed
Interconnect
ethernet-10g~10 GB/s
Component count
8 units
Components
  • 8×rtx-4090

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

Ray ServevLLMSGLang

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Request routingTensor parallel

Why this combo

Ray Serve multi-node distributed is the production orchestration pattern for high-concurrency LLM serving. The use cases:

  • 4-32 concurrent agent loops requiring high aggregate throughput
  • Multi-tenant inference with autoscaling
  • Resilient serving with graceful node-failure handling
  • Multi-model deployment with request routing

What it's NOT for:

  • Fitting a single model larger than per-replica capacity
  • Single-user / low-concurrency workloads
  • Anything where operational simplicity matters more than scale

Runtime compatibility

  • Ray Serve ✓ the reference orchestration layer.
  • vLLM ✓ excellent inside Ray Serve replicas.
  • SGLang ✓ excellent — particularly when prefix-cache hit rate is high.
  • TGI ✓ HuggingFace's serving alternative; works inside Ray Serve.

Distinguishing from SGLang distributed

This combo and /hardware-combos/sglang-distributed-multi-node (when added) solve different problems:

  • Ray Serve: replica orchestration. Each replica is independent. Scales throughput.
  • SGLang distributed: model sharding. The model spans nodes. Scales model size.

If you need 'I want to serve 100 users on 70B,' Ray Serve. If you need 'I want to serve 5 users on 235B,' SGLang distributed.

Setup notes

  • 10 GbE minimum between nodes; 25 GbE recommended
  • Static IPs on a dedicated subnet for inference traffic
  • Object storage (S3 / MinIO) for shared model weights
  • Prometheus + Grafana for replica monitoring
  • Plan ~1-2 weeks for production setup

Related

  • /stacks/distributed-inference-homelab — full deployment recipe
  • /systems/distributed-inference — architectural depth
  • /tools/ray-serve — runtime operational review
  • /guides/running-local-ai-on-multiple-gpus-2026 — buying guide

Best model classes

  • 70B serving at high concurrency — 4 replicas × 2× tensor-parallel = 4 concurrent serving paths, each at full 70B Q4 capacity.
  • 30-50B with autoscaling — Ray Serve scales replicas up/down based on request queue.
  • Multi-model serving — different replicas can host different models, routing by request metadata.

The combo is built for throughput aggregation and operational flexibility, not for fitting a single very-large model.

What this combo is bad at

  • >70B single-model serving — each replica is capped at ~45 GB. For 100B+ models, use SGLang distributed or single-node multi-GPU.
  • Cost-efficient single-user inference — 8 cards is overkill for solo use; single dual-4090 covers it.
  • Tight-latency single-stream — request routing adds ~5-15ms vs single-machine deployment.

Who should avoid this

  • Anyone whose model fits on 1-2 cards — operational complexity isn't worth it.
  • Hobbyists — Ray Serve is production tooling; the learning curve is steep.
  • Single-user workloads — replica orchestration adds zero value for one user.
Power & thermal
~3600W peak

4 nodes × 2× 4090 each = 8 cards across 4 chassis. Distributed thermal load is easier than concentrated, but each chassis still produces ~900W of heat. Plan datacenter or large server-room placement.

Reliability

Ray Serve is the production-grade orchestration layer; node failure is handled gracefully via replica re-routing. Network reliability is the dominant failure mode; 10 GbE minimum, 25 GbE recommended for production.

Recommended OS

Ubuntu 22.04 LTS or 24.04 LTS on all nodes; Docker / Kubernetes deployment recommended.

Operator warning — failure modes

Failure modes specific to Ray Serve multi-node distributed

  1. Network partition during request. A dropped network packet mid-stream can hang or fail requests. Ray Serve handles via retries but tail latency suffers.
  2. Replica configuration drift. Different nodes running slightly different vLLM versions / model weights produce inconsistent responses. Strict version pinning + automated deployment is mandatory.
  3. Autoscaling thrashing. Aggressive scale-up + scale-down policies can produce constant replica churn, killing GPU memory load efficiency. Tune scale-up / scale-down thresholds carefully.
  4. Head-of-line blocking on single replica. A slow request on one replica can back up its queue while others sit idle. Ray Serve's load balancing helps but doesn't eliminate this.
  5. Cross-replica KV cache misses. Each replica maintains its own KV cache; sticky-routing improves cache hit rate but adds head-of-line risk. RadixAttention via SGLang helps but requires per-replica configuration.
  6. Storage bottleneck on cold start. All replicas pulling the same model weights from object storage on cold start can saturate the storage backend. Pre-warm replicas before traffic surges.
Closest alternative

Vllm Tensor Parallel H100 Workstation →

Different problems: Ray Serve scales request throughput via replicas; H100 TP-4 scales single-stream model size + latency. Pick by whether you need many users or one big model.

Featured in stack

Distributed inference homelab →

Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

qwen-3-32b
pending
Estimate: 30-45 tok/s per stream × 4-32 concurrent

Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.

Going deeper

  • All hardware combinations — browse other multi-GPU and multi-machine setups.
  • Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
  • Distributed inference systems — architectural depth.