RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Hardware combinations
  4. /vLLM tensor-parallel 4× H100 80GB workstation
Single-node multi-GPUNVLink-Switchexpert

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

By Fredoline Eruo · Reviewed 2026-05-06
Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →
Memory budget
Total VRAM
320 GB
Effective for inference
300 GB
94% of total
Not pooled

4× H100 80GB SXM with NVLink-Switch fabric is the rare configuration where total VRAM ≈ effective VRAM. The NVLink-Switch (DGX-H100 chassis) provides full-mesh 900 GB/s bidirectional bandwidth between all 4 cards, allowing tensor parallelism with negligible cross-card overhead. Effective ceiling for inference is ~300 GB — total minus ~5 GB per card for activations, KV cache, and runtime overhead at 32K context. This is the configuration where Qwen 3.5 235B-A17B at FP8 fits with full headroom, or DeepSeek V4 Pro at AWQ-INT4 fits comfortably.

Topology

Topology
single-node-multi-gpu
Interconnect
nvlink-switch~900 GB/s
Component count
4 units
Components
  • 4×nvidia-h100-sxm

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

vLLMSGLangTensorRT-LLM

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Tensor parallelExpert routing

Why this combo

4× H100 80GB SXM is the datacenter production reference for local-AI serving. The use cases:

  • Frontier MoE production serving (Qwen 3.5, DeepSeek V4, Llama 4)
  • High-concurrency 70-100B inference for organizations
  • Research / training workloads
  • Multi-tenant agent serving at scale

Honest framing: this is enterprise-tier hardware. For individuals, hosted inference (Together, Fireworks, Anthropic API) is dramatically cheaper. The case for self-hosting at this tier is data sovereignty, custom models, or extreme inference volume.

Runtime compatibility

  • vLLM ✓ excellent. The reference deployment. --tensor-parallel-size 4 with FP8 or AWQ-INT4 quants.
  • SGLang ✓ excellent. Particularly strong for agent serving with stable system prompts.
  • TensorRT-LLM ✓ best-in-class throughput at the cost of recompile-per-config friction.
  • Ray Serve ✓ for multi-replica patterns at scale.

Comparison vs alternatives

Metric 4× H100 SXM workstation 4× RTX 4090 DGX H200 8-GPU
Effective VRAM 300 GB ~92 GB 1100 GB+
FP8 throughput Top-tier Limited Top-tier
Tokens/sec (Qwen 3.5 235B INT4) 80-150 N/A (doesn't fit) 200-400
Cost $200,000+ $5,000-7,500 $400,000+
Production readiness Yes No Yes

This is the floor for serious frontier-MoE production serving. Below this tier (4× 4090, quad 3090), the model envelope doesn't reach frontier-tier targets at any practical quant.

Cloud alternative

For most teams, hosted H100 (RunPod, Lambda, CoreWeave) is the right path until inference volume exceeds ~$10k/month sustained.

Related

  • /stacks/distributed-inference-homelab — multi-node alternative
  • /systems/distributed-inference — architectural depth
  • /tools/vllm — runtime operational review
  • /guides/running-local-ai-on-multiple-gpus-2026 — buying guide

Best model classes

  • Frontier MoE serving — Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick all fit at FP8 or INT4 with full production headroom.
  • High-concurrency 70-100B serving — vLLM serves 32+ concurrent agent loops at >50 tok/s each.
  • Long-context 1M-token workloads — KV cache budget is generous at this VRAM tier.

This is the production-default deployment for organizations serving inference at scale.

What this combo is bad at

  • Cost-constrained deployment — $200,000+ all-in for the chassis + 4× H100 SXM. Only justified at significant production scale.
  • Single-stream latency — tensor-parallel-4 doesn't beat tensor-parallel-2 for single-user latency; you only win on aggregate throughput.
  • Edge deployment — datacenter rack required.

Who should avoid this

  • Individual users / small teams — H100 cloud (RunPod, Lambda) is cheaper for sporadic workloads.
  • Hobby projects — quad RTX 3090 covers 90% of hobby use cases at 5% of the cost.
  • CUDA-version-sensitive workloads — H100 requires CUDA 12+ which may break older PyTorch / framework code.
Power & thermal
~2800W peak

DGX-class chassis. Datacenter rack required; not viable for office or home deployment. Liquid cooling optional but standard for 24/7 deployment.

Reliability

H100 SXM has the strongest reliability track record in production AI serving. NVIDIA enterprise warranty + datacenter SLAs. Failure modes are dominated by environmental factors (power quality, cooling) rather than card failure.

Recommended OS

Ubuntu 22.04 LTS with NVIDIA enterprise driver stack.

Operator warning — failure modes

Failure modes specific to 4× H100 SXM workstation

  1. Cooling under-spec. SXM modules require chassis-integrated liquid or aggressive air cooling; off-the-shelf chassis with PCIe H100 NVL ≠ SXM in cooling design. Verify thermal envelope before committing.
  2. CUDA / driver / vLLM version mismatch. H100 features (FP8 transformer engine, MIG partitioning) require precise stack alignment. Pin versions in production.
  3. NVLink-Switch firmware bugs. Rare but real — switch fabric issues produce subtle cross-card corruption that's hard to diagnose. Stay on NVIDIA-validated firmware.
  4. MIG partition complexity. Multi-Instance GPU mode is powerful but complex; misconfiguration produces silent throughput loss.
  5. Power delivery transients. 4× 700W = 2800W sustained; transients can hit 4000W. PDU and UPS sizing is non-trivial.
  6. Tensor-parallel-4 single-stream stall. Counter-intuitively, 4-rank tensor-parallel is slower per-stream than 2-rank because the all-reduce gets less efficient. For latency-critical single-user workloads, run 2× tensor-parallel-2 replicas instead.
Closest alternative

Quad Rtx 3090 →

If you can't justify the $200k+ datacenter spend, quad-3090 covers 100B-class at 5% of the cost. H100 wins on reliability + frontier-MoE; 3090 wins on price-to-capability ratio.

Featured in stack

4× H100 SXM tensor-parallel workstation →

DGX-class deployment recipe with vLLM TP-4, FP8 transformer engine, NVLink-Switch verification, and cost-realism vs cloud rental.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

qwen-3.5-235b-a17b
pending
Estimate: 60-90 tok/s decode (single stream)

Frontier MoE on the datacenter reference rig. FP8 fits comfortably in 4× 80GB; expect strong per-stream decode and dramatic concurrency lift via SGLang RadixAttention.

4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

deepseek-v4-flash
pending
Estimate: 100-160 tok/s decode (single stream)

DeepSeek V4 Flash is the throughput-tuned V4 sibling. 80B/12B-active on 4× H100 should produce strongest open-weight tok/s in 2026.

Going deeper

  • All hardware combinations — browse other multi-GPU and multi-machine setups.
  • Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
  • Distributed inference systems — architectural depth.