Distributed10 GbEexpert

What runs on Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)?

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

At a glance

Effective VRAM

80 / 192 GB

Not pooled

Speed penalty

~25%

vs ideal single-card

Recommended runtime

ray-serve

request routing

Setup difficulty

expert

~3600W peak

Models fit

Borderline

Not practical

Deployment recipe

Distributed inference homelab →

Ray Serve replica orchestration recipe — multi-node aggregate throughput pattern.

Memory budget

Total VRAM

192 GB

Effective for inference

80 GB

42% of total

Not pooled

Multi-node Ray Serve clusters do NOT pool VRAM across machines for a single model. Each node hosts its own replica (or tensor-parallel rank within a tensor-parallel-2 group on dual-4090 nodes). Effective VRAM 'for a single model' is the per-replica capacity (~45 GB), not the cluster total. The 192 GB total is meaningful only for **aggregate throughput** — 4 replicas serving 4× the requests, not 4× the model size. This is the pattern that prosumer multi-machine deployments most often misunderstand. If your goal is 'run a 200B model that doesn't fit on one machine,' Ray Serve is the wrong tool — you want SGLang distributed or Exo-style layer split. Ray Serve's value is replica orchestration, autoscaling, and request routing.

Why total VRAM is not the whole story

Multi-node deployment. Each replica holds a full copy of the model — aggregate throughput scales, but single-model size is capped by per-replica capacity. Effective single-replica VRAM ~80 GB.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

distributed

Interconnect

ethernet-10g~10 GB/s

Component count

8 units

Components

8×rtx-4090

Recommended runtime

ray-serve

Also: vllm, sglang

Recommended split strategy

request-routing

Also: tensor-parallel

Setup difficulty

expert

~3600W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Llama 3.2 90B Vision Instruct

Fits

90B·Q4_K_M → 60 GB·75% of effective VRAM·~25% speed penalty vs ideal

Llama 3.2 90B Vision

Fits

90B·AWQ-INT4 → 64 GB·80% of effective VRAM·~25% speed penalty vs ideal

InternVL 2.5 78B

Fits

78B·Q4_K_M → 52 GB·65% of effective VRAM·~25% speed penalty vs ideal

Molmo 72B

Fits

72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Qwen 2.5 Math 72B

Fits

72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Qwen 2.5 72B Instruct

Fits

72B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Qwen 2.5-VL 72B

Fits

72B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Llama 4 70B

Fits

70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Tulu 3 70B

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Dolphin 3 Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

EVA Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

OpenBioLLM Llama 3 70B

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Llama 3.1 70B Instruct

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

DeepSeek R1 Distill Llama 70B

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Hermes 4 70B FP8

Fits

70B·Q4_K_M → 49 GB·61% of effective VRAM·~25% speed penalty vs ideal

Hermes 3 Llama 3.1 70B

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Llama 3.1 Nemotron 70B Instruct

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Hermes 4 Llama 3.3 70B

Fits

70B·AWQ-INT4 → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Llama 3.3 70B Instruct

Fits

70B·Q4_K_M → 48 GB·60% of effective VRAM·~25% speed penalty vs ideal

Jamba 1.5 Mini

Fits

52B·Q4_K_M → 36 GB·45% of effective VRAM·~25% speed penalty vs ideal

Nemotron 3 Super 49B

Fits

49B·AWQ-INT4 → 32 GB·40% of effective VRAM·~25% speed penalty vs ideal

Mixtral 8x7B Instruct

Fits

47B·Q4_K_M → 32 GB·40% of effective VRAM·~25% speed penalty vs ideal

Mixtral 8X7B Instruct v0.1 GPTQ

Fits

46.7B·Q4_K_M → 33 GB·41% of effective VRAM·~25% speed penalty vs ideal

Falcon 40B Instruct

Fits

40B·Q4_K_M → 28 GB·35% of effective VRAM·~25% speed penalty vs ideal

Borderline (7)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Mistral Large 2 (123B)

Borderline

123B·Q4_K_M → 88 GB·110% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >110% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Nemotron 3 Super (120B-A12B)

Borderline

120B·Q4_K_M → 84 GB·105% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >105% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 Scout

Borderline

109B·Q4_K_M → 80 GB·100% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Sarvam 105B

Borderline

105B·Q4_K_M → 74 GB·93% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Sarvam 105B FP8

Borderline

105B·Q4_K_M → 74 GB·93% of effective VRAM·~25% speed penalty vs ideal

Effective VRAM utilization >93% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Command R+ (Aug 2024)

Borderline

104B·AWQ-INT4 → 72 GB·90% of effective VRAM·~25% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Command R+ 104B

Borderline

104B·Q4_K_M → 70 GB·88% of effective VRAM·~25% speed penalty vs ideal

Combination fits but with little headroom. Verify KV cache budget for your target context window before committing.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)

Not practical

1600B·Q4_K_M → 1024 GB·1280% of effective VRAM·~25% speed penalty vs ideal