What runs on Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)?
Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.
Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.
Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.
Multi-node deployment. Each replica holds a full copy of the model — aggregate throughput scales, but single-model size is capped by per-replica capacity. Effective single-replica VRAM ~80 GB.
See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.
Topology
- 8×rtx-4090
Models that fit comfortably (24)
Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.
Borderline (7)
Fits but with little headroom. KV cache for long context may not fit; verify before deployment.
Effective VRAM utilization >110% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >105% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.
Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.
Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.
Not practical (8)
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.
Benchmark opportunities
estimates, not measurementsPending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.
Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.
Going deeper
- Full combo detail page — operational review with failure modes and runtime matrix.
- Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
- RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
- Will-it-run home — single-card check + custom builds.