NVIDIA H200
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Hopper refresh — 141GB HBM3e at ~4.8 TB/s. Datacenter-class; rentable on RunPod, Lambda, etc.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 966 / 1000. Headline = 966 × 0.70 (Estimated-confidence discount) = 676. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 4800 GB/s bandwidth — 576.0 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B comfortably — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The H200 is the H100's mid-life refresh and the right answer for almost every "I need datacenter-grade frontier model inference and I don't already own H100s" decision in 2026. The headline change from H100: 141 GB HBM3e at 4.8 TB/s, vs the H100's 80 GB HBM3 at 3.35 TB/s. That's ~76% more memory and ~43% more bandwidth, on the same architecture, in the same SXM5 socket, with the same software stack. The bandwidth gain shows up directly in long-context decode and large-prompt prefill — the workloads where H100 was already best-in-class. Memory headroom now fits Llama 405B Q4 on a single card or DeepSeek V3 671B at Q1.5 with comfortable context, and 2× H200 NVLinked (282 GB combined at NVLink 900 GB/s) handles 405B FP16 or 671B Q3 with full operational context. NVIDIA's full enterprise stack works: NeMo, Triton, TensorRT-LLM, BlueField DPU integration, MIG partitioning. Cloud rental at ~$3–4.50/hr on Runpod / Lambda makes it accessible without cap-ex.
Where it breaks
- It's no longer the frontier. B200 at 192 GB / 8 TB/s is the 2026 training flagship. H200 is the 2024 flagship; B200 is what NVIDIA wants you to buy now. For training scale and FP4 throughput, B200 wins. For inference, H200's $/throughput is still better than B200 in most realistic 2026 workloads.
- SXM5 only at top tier — PCIe H200 NVL is a different SKU with much lower bandwidth. The 4.8 TB/s spec is SXM5. PCIe H200 NVL is ~3 TB/s effective. Read the SKU carefully when renting or buying.
- Cap-ex is real. $30,000–$32,000 retail for SXM5 H200, plus the DGX-class motherboard and cooling overhead. Most of the world should be renting H200, not buying.
- Power and thermal density. 700 W TDP, dense rack workloads, datacenter cooling assumed. Not for an under-the-desk workstation. The "H200 in your office" build doesn't exist outside DGX Station tier.
- Marginal vs H100 for many workloads. If your model fits 80 GB and your context isn't the bottleneck, H100 at $25,000 vs H200 at $31,000 may not justify the upgrade — the gap is meaningful but not transformational on shorter-context inference.
Ideal model range
- Sweet spot: 405B-class single-card inference at Q4–Q5 with long context. The first datacenter card that does this without multi-card complexity.
- Sweet spot: 70B and 200B-class at FP16 with very long contexts (128K+) where bandwidth dominates. The 4.8 TB/s shows here.
- Sweet spot: Multi-tenant production serving — vLLM continuous batching across 30–80 concurrent users on 70B FP16 with 16–32K context, or 100+ users on 32B FP16.
- Stretch: 671B (DeepSeek V3 / R1) at Q1.5–Q2 single-card. Yes it runs, no it's not the best $/req — pick 2× H200 with NVLink for proper 671B serving.
- Stretch: Frontier-model fine-tuning. 70B QLoRA fits one H200 with comfortable headroom; 70B FP16 fine-tuning fits 2× H200 NVLinked.
Bad use cases
- Single-developer hobby workloads. Rent on Runpod or buy a 4090 / 5090. The H200's value is multi-tenant production serving and frontier inference, not single-user.
- Anything that fits 24–48 GB. L40S at 1/4 the price wins by every metric. Don't overprovision.
- Frontier training where you'd actually use B200's FP4. Rent B200 instead.
- Buying retail at cap-ex without a steady-state workload. Renting H200 at $3–4.50/hr breaks even with cap-ex around 7,000+ hours of utilization (~9 months of 24×7). Most workloads don't justify this.
Verdict
Buy this if you're operating a datacenter or colo with steady-state 70B+ FP16 production inference, frontier-model serving (200B/405B/671B), or multi-tenant inference at scale, and you've calculated cap-ex vs rental over a 2-year horizon and the cap-ex wins. The H200 is the canonical "I need datacenter-grade frontier inference and I'm running it 24×7" GPU. Pair with NVLink for 282 GB tier when single-card isn't enough.
Skip this if you're a hobbyist (rent or buy consumer), your workload fits L40S (much better $/throughput), you can rent H200 at <50% utilization (rental dominates), you're frontier-training (B200 is the right pick), or you're a startup that should be on cloud rental until your inference economics justify cap-ex.
How it compares
- vs H100 SXM (80 GB) → H200 is the same chip with 76% more memory + 43% more bandwidth. Pick H200 over H100 SXM for any new build; pick H100 SXM only if you're matching an existing H100 cluster or finding it at >25% discount. See /compare/nvidia-h200-vs-nvidia-h100-sxm.
- vs B200 (192 GB) → B200 has more memory + bandwidth + native FP4 support, at higher cap-ex (~$40,000) and more demanding cooling. Pick B200 for frontier training and FP4 production; pick H200 for 90% of inference workloads where the cost gap doesn't pay for itself.
- vs H100 NVL (188 GB) → H100 NVL is two H100s NVLinked in a single SKU at ~$60,000. H200 is one card at $31,000 with similar effective memory ceiling. Pick H200 for new builds; H100 NVL only makes sense in specific 188 GB single-SKU deployment slots.
- vs L40S (48 GB) → L40S at $7,500 is roughly 1/4 the price and 1/3 the bandwidth (864 GB/s vs 4.8 TB/s). For 70B Q4 / 32B FP16 inference, L40S wins $/throughput. For frontier models or long context or training, H200 dominates.
- vs renting on Runpod / Lambda / Together → H200 rents at ~$3–4.50/hr SXM, ~$2.50–3.50/hr PCIe. Cap-ex breakeven is ~7,000+ hours = 9 months 24×7. Most readers should rent H200 first and only buy when steady-state utilization > 70% sustains it for >12 months.
Overview
Hopper refresh — 141GB HBM3e at ~4.8 TB/s. Datacenter-class; rentable on RunPod, Lambda, etc.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 141 GB |
| Power draw (peak) | 700 W |
| Released | 2024 |
| MSRP | $31000 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA H200 with usable context.
Hardware worth comparing
The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.
Frequently asked
What models can NVIDIA H200 run?
Does NVIDIA H200 support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.