RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)
Distributed10 GbEexpert

What runs on Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)?

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

At a glance
Effective VRAM
80 / 192 GB
Not pooled
Speed penalty
~25%
vs ideal single-card
Recommended runtime
ray-serve
request routing
Setup difficulty
expert
~3600W peak
24
Models fit
7
Borderline
8
Not practical
Deployment recipe
Distributed inference homelab →

Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.

Memory budget
Total VRAM
192 GB
Effective for inference
80 GB
42% of total
Not pooled

Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.

Why total VRAM is not the whole story

Multi-node deployment. Each replica holds a full copy of the model — aggregate throughput scales, but single-model size is capped by per-replica capacity. Effective single-replica VRAM ~80 GB.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

Topology
distributed
Interconnect
ethernet-10g~10 GB/s
Component count
8 units
Components
  • 8×rtx-4090
Recommended runtime
ray-serve
Also: vllm, sglang
Recommended split strategy
request-routing
Also: tensor-parallel
Setup difficulty
expert
~3600W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Llama 3.2 90B Vision Instruct
Fits
90B·Q4_K_M → 60 GB·75% of effective VRAM·~25% speed penalty vs ideal
Llama 3.2 90B Vision
Fits
90B·AWQ-INT4 → 64 GB·80% of effective VRAM·~25% speed penalty vs ideal
InternVL 2.5 78B
Fits
78B·Q4_K_M → 52 GB·65% of effective VRAM·~25% speed penalty vs ideal
Molmo 72B
Fits
72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Qwen 2.5 Math 72B
Fits
72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Qwen 2.5 72B Instruct
Fits
72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Qwen 2.5-VL 72B
Fits
72B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Llama 4 70B
Fits
70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Tulu 3 70B
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Dolphin 3 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
EVA Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
OpenBioLLM Llama 3 70B
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Llama 3.1 70B Instruct
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
DeepSeek R1 Distill Llama 70B
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Hermes 4 70B FP8
Fits
70B·Q4_K_M → 49 GB·61% of effective VRAM·~25% speed penalty vs ideal
Hermes 3 Llama 3.1 70B
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Llama 3.1 Nemotron 70B Instruct
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Hermes 4 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Llama 3.3 70B Instruct
Fits
70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal
Jamba 1.5 Mini
Fits
52B·Q4_K_M → 36 GB·45% of effective VRAM·~25% speed penalty vs ideal
Nemotron 3 Super 49B
Fits
49B·AWQ-INT4 → 32 GB·40% of effective VRAM·~25% speed penalty vs ideal
Mixtral 8x7B Instruct
Fits
47B·Q4_K_M → 32 GB·40% of effective VRAM·~25% speed penalty vs ideal
Mixtral 8X7B Instruct v0.1 GPTQ
Fits
46.7B·Q4_K_M → 33 GB·41% of effective VRAM·~25% speed penalty vs ideal
Falcon 40B Instruct
Fits
40B·Q4_K_M → 28 GB·35% of effective VRAM·~25% speed penalty vs ideal

Borderline (7)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Mistral Large 2 (123B)
Borderline
123B·Q4_K_M → 88 GB·110% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >110% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Nemotron 3 Super (120B-A12B)
Borderline
120B·Q4_K_M → 84 GB·105% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >105% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 Scout
Borderline
109B·Q4_K_M → 80 GB·100% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Sarvam 105B
Borderline
105B·Q4_K_M → 74 GB·93% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Sarvam 105B FP8
Borderline
105B·Q4_K_M → 74 GB·93% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Command R+ (Aug 2024)
Borderline
104B·AWQ-INT4 → 72 GB·90% of effective VRAM·~25% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Command R+ 104B
Borderline
104B·Q4_K_M → 70 GB·88% of effective VRAM·~25% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·1280% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·800% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·875% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·600% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·560% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·525% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·525% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·350% of effective VRAM·~25% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)
pending
Estimate: 30-45 tok/s per stream × 4-32 concurrent

Ray Serve replica orchestration. Each replica runs vLLM tensor-parallel-2; 4 replicas = 4 parallel serving paths. Measure aggregate throughput vs concurrency scan.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
  • Will-it-run home — single-card check + custom builds.