Stack · L3 execution·Homelab tier

Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling

4 used RTX 3090s in an open-frame chassis. 96 GB total / ~88 GB effective. vLLM tensor-parallel-4 for 100B-class dense or 50-100B MoE models. The prosumer ceiling before paying datacenter prices.

By Fredoline Eruo · Last reviewed 2026-05-07

What this stack accomplishes

Quad RTX 3090 is the prosumer ceiling for local AI. Beyond this, you're paying datacenter prices (H100 / A100 cluster) or distributing across multiple machines. The use case sweet spot:

  • 100B-class dense models that don't fit in 48 GB
  • Large MoE models (Mixtral 8x22B, DeepSeek MoE)
  • 70B serving at high concurrency (16+ users)
  • Long-context (32K-128K) inference where KV cache budget matters

What it's NOT for: single-user latency (dual-3090 is faster per-stream), production at scale (datacenter wins on reliability per dollar), or anyone whose answer to “where will this live?” is “my desk.”

Hardware required

4× RTX 3090 24GB (used) · NVLink 3-slot bridges (paired: cards 0-1 and 2-3) · WRX80 / W790 workstation motherboard with 4× PCIe 4.0 x16 slots · 1600W Platinum PSU OR dual 1000W PSUs synced via Add2PSU · 128GB DDR5 ECC · 2TB NVMe · Open-frame mining rig OR 4U server chassis · Inter-card cooling fans · Ubuntu 24.04 LTS

Components — what to install and why

The stack
  1. 01
    HardwareGPUs (4× 24GB used; the prosumer-ceiling stack)
    rtx-3090

    Quad 3090 is the largest practical multi-GPU setup before datacenter pricing kicks in. Used 3090 economics: $600-900 each → $2,400-3,600 total for the GPUs vs $20,000+ for a single H100 80GB used.

  2. 02
    ToolInference engine (tensor-parallel-4)
    vllm

    vLLM --tensor-parallel-size 4 is the canonical configuration for quad-GPU. NVLink between paired cards (0-1, 2-3); cross-pair traffic goes over PCIe. For maximum single-stream latency, run 2× tensor-parallel-2 replicas instead — counter-intuitive but the cross-pair PCIe overhead at TP-4 dominates.

  3. 03
    ToolOrchestration layer for 2×TP-2 replica pattern
    ray-serve

    Ray Serve over quad-3090: split into 2 replicas of tensor-parallel-2 each. Higher single-stream throughput than 1×TP-4; better aggregate concurrency. Use when serving 8+ users.

  4. 04
    ModelLarge MoE model (39B-active, 141B total)
    mixtral-8x22b-instruct

    Mixtral 8x22B at AWQ-INT4 fits across 88 GB effective with comfortable headroom. Expert routing across 4 cards is bandwidth-friendlier than dense tensor-parallel — the no-NVLink penalty between paired cards shrinks for MoE.

  5. 05
    Model70B reasoning model with extreme context
    deepseek-r1-distill-llama-70b

    70B at AWQ-INT4 (~40 GB) fits with massive headroom on quad-3090, freeing ~48 GB for KV cache — supports 64K+ context for long-document reasoning workloads.

  6. 06
    ModelCoding agent serving for 16+ concurrent users
    qwen-2.5-coder-32b-instruct

    32B-class coding model on quad-3090 leaves enormous headroom — vLLM continuous batching serves 16+ concurrent coding-agent loops at ~30 tok/s each. The team-tier coding-serve config.

  7. 07
    ToolMulti-user team chat frontend
    openwebui

    Open WebUI handles authentication + multi-user chat against the vLLM endpoint. Standard pairing for vLLM-backed serving at team scale.

Step-by-step setup

Assumes Ubuntu 24.04 LTS on a WRX80 or W790 workstation motherboard.

1. NVIDIA driver + CUDA toolkit

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560
sudo reboot

# Verify all 4 cards detected
nvidia-smi --query-gpu=index,name,memory.total,pci.bus_id --format=csv
# Expected: 4 lines, RTX 3090 24GB each, distinct bus IDs

# Verify PCIe lane allocation (workstation board should be x16/x16/x16/x16)
sudo lspci -vvv | grep -A 1 "VGA.*NVIDIA" | grep -i "lnksta"

2. Verify NVLink pair status

nvidia-smi nvlink --status
# Expected output for paired NVLink (cards 0-1, cards 2-3):
#   GPU 0: Active (linked to GPU 1)
#   GPU 1: Active (linked to GPU 0)
#   GPU 2: Active (linked to GPU 3)
#   GPU 3: Active (linked to GPU 2)
# Cross-pair (0↔2, 0↔3, 1↔2, 1↔3) goes over PCIe.

# Run pair-bandwidth check
nvbandwidth -t device_to_device_bidirectional_memcpy_ce
# Expected: paired cards report ~100+ GB/s; cross-pair ~32 GB/s (PCIe)

3. Install vLLM and serve Mixtral 8x22B

python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
pip install vllm autoawq

# Pull model
huggingface-cli download casperhansen/mixtral-8x22b-instruct-v0.1-awq

# Serve with tensor-parallel-4
vllm serve casperhansen/mixtral-8x22b-instruct-v0.1-awq \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --port 8000

Lower --gpu-memory-utilization to 0.90 (vs 0.92 for dual-card) — quad-card tensor-parallel has more cross-card communication overhead and benefits from the larger headroom for sync buffers.

4. (Optional) Run 2× tensor-parallel-2 replicas via Ray Serve

# For higher single-stream latency, split into 2 TP-2 replicas:
ray start --head
serve start
serve run vllm_serve:app --bind 0.0.0.0:8000 \
  --num-replicas 2 --tensor-parallel-size 2

Verifying NVLink + PCIe topology

Critical: most consumer chipsets cannot deliver x16 to all 4 slots. Verify with:

# Check link width on each GPU
for gpu in 0 1 2 3; do
  echo "GPU $gpu:"
  sudo nvidia-smi -i $gpu --query-gpu=pci.link.width.current --format=csv,noheader
done
# Expected on workstation board: 16, 16, 16, 16 (all x16)
# On consumer board: typical 8, 8, 8, 8 — costs ~15% throughput

Expected outcome

vLLM tensor-parallel-4 serving Mixtral 8x22B AWQ-INT4 at 25-35 tok/s decode (single stream); 100+ tok/s aggregate at 4 concurrent. Sustained power: 1300-1450W. Memory-junction temps stay below 105°C with active inter-card cooling. Operates 24/7 in a basement or server room only.

Power, thermal, and acoustic notes (read this first)

  • 1600W Platinum PSU minimum for sustained 1400W draw. Transient spikes can hit 1700-1800W; size for headroom or use dual 1000W PSUs synced via Add2PSU.
  • Open-frame chassis or 4U server chassis required. Standard ATX towers cannot dissipate 1400W of GPU heat — thermal throttling on inner cards is the typical result.
  • Inter-card cooling fans mandatory. Cards in slots 2-3 are sandwiched and reach memory-junction temps over 110°C without active cooling. Industrial 120mm fans between cards.
  • Acoustic envelope: 60-70 dBA at 1m. Loud. Plan basement / server-room placement; this is not desk-friendly.
  • Power cost: 1400W × 24h × 365 days = 12,264 kWh/year. At $0.13/kWh that's ~$1,600/year just for the GPUs at sustained 100% utilization.

Failure modes you'll hit

  1. PCIe lane allocation collapse on consumer boards. Most X670/B650 boards drop to x8/x8/x8/x4 with all 4 slots populated. Workstation boards (WRX80/W790) are required for x16/x16/x16/x16. Without it, tensor-parallel performance drops 15-25%.
  2. Power-supply transient overload. Four 3090s spike to 1700-1800W transient. A 1600W PSU sized exactly trips OCP. Use dual PSUs or 1800W+.
  3. NVLink-incompatible runtime configurations. vLLM auto-detects NVLink between paired cards. Some llama.cpp builds require explicit --tensor-split arguments to use it; getting the syntax wrong silently disables NVLink.
  4. NUMA placement on dual-socket builds. GPUs need to be on the same NUMA node as the inference process. Mismatches cost 30%+ throughput silently. Use numactl --cpunodebind=0 --membind=0 vllm serve ....
  5. Memory-junction thermal throttling on inner cards. Slot-2 and slot-3 cards heat faster (sandwiched). nvidia-smi --query-gpu=temperature.memory reveals junction temps invisible in the standard output.
  6. Used-card thermal-paste degradation. 5-year-old 3090s benefit from a thermal-paste refresh. Plan for re-paste on all 4 before committing to production workloads.

Troubleshooting

Symptom: tensor-parallel-4 is slower than expected. Check the actual PCIe link widths via lspci. If x8 instead of x16, your motherboard is bottlenecking. Workstation boards (WRX80/W790) deliver true x16 to all 4 slots.

Symptom: 2 cards visible, 4 plugged in. Power supply is undersized — the BIOS may be dropping cards that draw too much current at boot. Verify each card individually with a single-card config first.

Symptom: random crashes after hours of inference. Memory-junction thermal throttling has crossed the failure threshold. Add inter-card fans; consider water-cooling the inner cards.

Variations and alternatives

Frontier-MoE variant: this stack does not fit Qwen 3.5 235B-A17B or DeepSeek V4 Pro at any practical quant. Move to Mac Studio M3 Ultra 192GB or 4× H100 cluster.

Reasoning-first variant: serve DeepSeek R1 Distill Llama 70B with 64K+ context — quad-3090 leaves substantial KV-cache headroom for long reasoning chains.

Single-machine alternative: Mac Studio M3 Ultra 192GB fits 192 GB at 370W vs quad-3090's 1400W. Slower per-stream but simpler operationally.

Who should avoid this build

  • Living-room or office deployments — thermal + acoustic envelope is unacceptable.
  • First-time multi-GPU builders — start with dual before quad. Operational complexity is real.
  • Anyone considering a single H100 80GB — at $20-30k used, H100 makes sense if budget allows; quad-3090 is the path when budget is hard-capped at $4-5k.
  • Production users needing reliability — datacenter hardware exists for a reason. Quad used consumer cards is a hobby-lab choice.

Going deeper