UNIT · NVIDIA · GPU
141 GB VRAMworkstationReviewed June 2026

NVIDIA H200

No editorial image yet — generic vendor mark shown. Credentials in spec table below.

Hopper refresh — 141GB HBM3e at ~4.8 TB/s. Datacenter-class; rentable on RunPod, Lambda, etc.

Released 2024·4800 GB/s memory bandwidth
▼ CHECK CURRENT PRICE· 1 retailer
NVIDIA H200

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE
See full leaderboard →
676/ 1000
BB-tier
Estimated
Throughput
500/ 500
VRAM-fit
200/ 200
Ecosystem
200/ 200
Efficiency
66/ 100

Sub-scores sum to 966 / 1000. Headline = 966 × 0.70 (Estimated-confidence discount) = 676. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 4800 GB/s bandwidth — 576.0 tok/s estimated. No measured benchmarks yet.

Plain-English: Runs 70B comfortably — snappy enough for a coding agent; vision models supported.

7B chat
Comfortable
14B chat
Comfortable
32B chat
Comfortable
70B chat
Comfortable
Coding agent
Comfortable
Vision (≤8B VLM)
Comfortable
Long context (32K)
Comfortable
Comfortable — fits with headroom
~Tight — works, no slack
Marginal — needs aggressive quant
Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
10.0/10

What it does well

The H200 is the H100's mid-life refresh and the right answer for almost every "I need datacenter-grade frontier model inference and I don't already own H100s" decision in 2026. The headline change from H100: 141 GB HBM3e at 4.8 TB/s, vs the H100's 80 GB HBM3 at 3.35 TB/s. That's ~76% more memory and ~43% more bandwidth, on the same architecture, in the same SXM5 socket, with the same software stack. The bandwidth gain shows up directly in long-context decode and large-prompt prefill — the workloads where H100 was already best-in-class. Memory headroom now fits Llama 405B Q4 on a single card or DeepSeek V3 671B at Q1.5 with comfortable context, and 2× H200 NVLinked (282 GB combined at NVLink 900 GB/s) handles 405B FP16 or 671B Q3 with full operational context. NVIDIA's full enterprise stack works: NeMo, Triton, TensorRT-LLM, BlueField DPU integration, MIG partitioning. Cloud rental at ~$3–4.50/hr on Runpod / Lambda makes it accessible without cap-ex.

Where it breaks

  • It's no longer the frontier. B200 at 192 GB / 8 TB/s is the 2026 training flagship. H200 is the 2024 flagship; B200 is what NVIDIA wants you to buy now. For training scale and FP4 throughput, B200 wins. For inference, H200's $/throughput is still better than B200 in most realistic 2026 workloads.
  • SXM5 only at top tier — PCIe H200 NVL is a different SKU with much lower bandwidth. The 4.8 TB/s spec is SXM5. PCIe H200 NVL is ~3 TB/s effective. Read the SKU carefully when renting or buying.
  • Cap-ex is real. $30,000–$32,000 retail for SXM5 H200, plus the DGX-class motherboard and cooling overhead. Most of the world should be renting H200, not buying.
  • Power and thermal density. 700 W TDP, dense rack workloads, datacenter cooling assumed. Not for an under-the-desk workstation. The "H200 in your office" build doesn't exist outside DGX Station tier.
  • Marginal vs H100 for many workloads. If your model fits 80 GB and your context isn't the bottleneck, H100 at $25,000 vs H200 at $31,000 may not justify the upgrade — the gap is meaningful but not transformational on shorter-context inference.

Ideal model range

  • Sweet spot: 405B-class single-card inference at Q4–Q5 with long context. The first datacenter card that does this without multi-card complexity.
  • Sweet spot: 70B and 200B-class at FP16 with very long contexts (128K+) where bandwidth dominates. The 4.8 TB/s shows here.
  • Sweet spot: Multi-tenant production serving — vLLM continuous batching across 30–80 concurrent users on 70B FP16 with 16–32K context, or 100+ users on 32B FP16.
  • Stretch: 671B (DeepSeek V3 / R1) at Q1.5–Q2 single-card. Yes it runs, no it's not the best $/req — pick 2× H200 with NVLink for proper 671B serving.
  • Stretch: Frontier-model fine-tuning. 70B QLoRA fits one H200 with comfortable headroom; 70B FP16 fine-tuning fits 2× H200 NVLinked.

Bad use cases

  • Single-developer hobby workloads. Rent on Runpod or buy a 4090 / 5090. The H200's value is multi-tenant production serving and frontier inference, not single-user.
  • Anything that fits 24–48 GB. L40S at 1/4 the price wins by every metric. Don't overprovision.
  • Frontier training where you'd actually use B200's FP4. Rent B200 instead.
  • Buying retail at cap-ex without a steady-state workload. Renting H200 at $3–4.50/hr breaks even with cap-ex around 7,000+ hours of utilization (~9 months of 24×7). Most workloads don't justify this.

Verdict

Buy this if you're operating a datacenter or colo with steady-state 70B+ FP16 production inference, frontier-model serving (200B/405B/671B), or multi-tenant inference at scale, and you've calculated cap-ex vs rental over a 2-year horizon and the cap-ex wins. The H200 is the canonical "I need datacenter-grade frontier inference and I'm running it 24×7" GPU. Pair with NVLink for 282 GB tier when single-card isn't enough.

Skip this if you're a hobbyist (rent or buy consumer), your workload fits L40S (much better $/throughput), you can rent H200 at <50% utilization (rental dominates), you're frontier-training (B200 is the right pick), or you're a startup that should be on cloud rental until your inference economics justify cap-ex.

How it compares

  • vs H100 SXM (80 GB) → H200 is the same chip with 76% more memory + 43% more bandwidth. Pick H200 over H100 SXM for any new build; pick H100 SXM only if you're matching an existing H100 cluster or finding it at >25% discount. See /compare/nvidia-h200-vs-nvidia-h100-sxm.
  • vs B200 (192 GB) → B200 has more memory + bandwidth + native FP4 support, at higher cap-ex (~$40,000) and more demanding cooling. Pick B200 for frontier training and FP4 production; pick H200 for 90% of inference workloads where the cost gap doesn't pay for itself.
  • vs H100 NVL (188 GB) → H100 NVL is two H100s NVLinked in a single SKU at ~$60,000. H200 is one card at $31,000 with similar effective memory ceiling. Pick H200 for new builds; H100 NVL only makes sense in specific 188 GB single-SKU deployment slots.
  • vs L40S (48 GB) → L40S at $7,500 is roughly 1/4 the price and 1/3 the bandwidth (864 GB/s vs 4.8 TB/s). For 70B Q4 / 32B FP16 inference, L40S wins $/throughput. For frontier models or long context or training, H200 dominates.
  • vs renting on Runpod / Lambda / Together → H200 rents at ~$3–4.50/hr SXM, ~$2.50–3.50/hr PCIe. Cap-ex breakeven is ~7,000+ hours = 9 months 24×7. Most readers should rent H200 first and only buy when steady-state utilization > 70% sustains it for >12 months.
BLK · OVERVIEW

Overview

Hopper refresh — 141GB HBM3e at ~4.8 TB/s. Datacenter-class; rentable on RunPod, Lambda, etc.

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

BLK · SPECS

Specs

VRAM141 GB
Power draw (peak)700 W
Released2024
MSRP$31000
Backends
CUDA

Models that fit

Open-weight models small enough to run on NVIDIA H200 with usable context.

Compare alternatives

Hardware worth comparing

The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.

Frequently asked

What models can NVIDIA H200 run?

With 141GB VRAM, the NVIDIA H200 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA H200 support CUDA?

Yes — NVIDIA H200 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.