RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
  1. >
  2. Home
  3. /Hardware
  4. /NVIDIA H100 SXM
UNIT · NVIDIA · GPU
80 GB VRAMworkstation·Reviewed June 2026

NVIDIA H100 SXM

NVIDIA H100 SXM spec card — 80 GB HBM3 VRAM, 3.35 TB/s bandwidth, 700 W; 70B FP16 and vLLM serving workloads
diagram
Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

Hopper SXM5 — 80GB HBM3 at 3.35 TB/s. The original GPU that trained GPT-4. Cloud-rentable.

Released 2022·3350 GB/s memory bandwidth
▼ CHECK CURRENT PRICE· 1 retailer
NVIDIA H100 SXM
Check on Amazon→

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE
See full leaderboard →
655/ 1000
BB-tier
Estimated
Throughput
500/ 500
VRAM-fit
190/ 200
Ecosystem
200/ 200
Efficiency
46/ 100

Sub-scores sum to 936 / 1000. Headline = 936 × 0.70 (Estimated-confidence discount) = 655. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 3350 GB/s bandwidth — 402.0 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT
Try other hardware →

Plain-English: Runs 70B comfortably — snappy enough for a coding agent; vision models supported.

7B chat✓
Comfortable
14B chat✓
Comfortable
32B chat✓
Comfortable
70B chat✓
Comfortable
Coding agent✓
Comfortable
Vision (≤8B VLM)✓
Comfortable
Long context (32K)✓
Comfortable
✓Comfortable — fits with headroom
~Tight — works, no slack
△Marginal — needs aggressive quant
✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
10.0/10

What it does well

The H100 SXM5 is the GPU that defined production LLM training and inference for the modern era. 80 GB HBM3 at 3.35 TB/s, 700 W TDP, and full NVLink mesh (900 GB/s between cards) at the SXM5 socket level — this is what an 8× DGX H100 box uses, and it's still the dominant deployment in 2026 hyperscaler cap-ex despite B200's Blackwell launch. Hopper architecture features are mature: native FP8 with first-gen Transformer Engine, dynamic FP8 scaling that delivers ~2× FP16 throughput on most modern frameworks, MIG (multi-instance GPU) for safe multi-tenant partitioning, and confidential computing extensions. The full NVIDIA stack is aggressively H100-tuned: TensorRT-LLM ships H100-specific kernels first, vLLM has the most-optimized H100 paths, and most production research papers from 2024–2025 cite H100 cluster training. Cap-ex around $30,000–$32,000 retail (or $25,000+ used as B200 ramps) and ~$3.50–$5.00/hr SXM rental — the standard datacenter inference / training tier when you need the SXM5 NVLink mesh advantage.

Where it breaks

  • Architecture is no longer current. B200 is the 2026 flagship: 192 GB / 8 TB/s / FP4 native. For new cap-ex on frontier training workloads, B200 is the right tier.
  • No FP4 native. Hopper has FP8 but not FP4 — frameworks now exploiting FP4 (TRT-LLM 0.10+, vLLM v0.7+, certain quantization libraries) get meaningful additional throughput on Blackwell that H100 can't match.
  • DGX motherboard requirement. SXM5 doesn't fit standard PCIe servers — you need a DGX-class chassis or an HGX baseboard from Supermicro / Dell / HPE. The motherboard premium is real.
  • Power and thermal density. 700 W TDP per card, 8-card baseboards pull 5.5+ kW continuous. This is datacenter-only — no office or even small colo deployment.
  • Memory ceiling vs H200 / MI300X. 80 GB. H200 at the same socket gives 141 GB. MI300X gives 192 GB. For memory-bound large-context inference, H100 SXM is the floor of the SXM-tier.
  • Resale erosion is starting. Used H100 SXM has dropped from $40,000+ peaks to ~$25,000. As B200 production ramps and H200 absorbs the upper tier, expect continued price softening over 2026.

Ideal model range

  • Sweet spot: 70B production multi-tenant serving via vLLM continuous batching at full FP8. ~150 concurrent users on a single 8× H100 DGX node.
  • Sweet spot: 200B-class production at FP8 across 4×–8× H100 SXM with full NVLink mesh.
  • Sweet spot: Frontier-model fine-tuning (70B FP16 full-finetune across 4× H100 SXM, or 200B+ across 8× H100) — the proven training tier.
  • Sweet spot: 405B production inference across 8× H100 SXM with NVLink mesh — the standard 2024–2025 deployment that's still common in 2026.
  • Stretch: 671B (DeepSeek V3 / R1) production serving across 8× H100 SXM with paged offload to system memory.
  • Comfortable: Anything an A100 80GB SXM does, but with FP8 throughput improvements and modern Transformer Engine optimizations.

Bad use cases

  • Single-card non-DGX deployments. Pick H100 PCIe instead — same chip, half the TDP, fits any PCIe server, ~25% cheaper.
  • Hobbyist / single-developer workloads. Wrong tier entirely. Rent for hours; don't buy.
  • Anything that fits 48 GB. L40S at 1/4 the cap-ex wins for production sub-48 GB inference.
  • New cap-ex when H200 exists. H200 is the same socket with 76% more memory + 43% more bandwidth at +25% price. Almost always the better buy in 2026.
  • Frontier training where FP4 / Blackwell-gen TE2 dominate. Pick B200.
  • Cap-ex without a 24×7 high-utilization workload. Rent on Runpod / Lambda at $3.50–$5/hr SXM.

Verdict

Buy this if you're operating production datacenter training or inference at multi-card scale, you need full SXM5 NVLink mesh (8×-card tensor parallelism with 900 GB/s interconnect), you have or are deploying DGX-class infrastructure, and you've validated cap-ex over 18+ month horizon vs rental. H100 SXM5 is the canonical "I run an 8× DGX H100 box for serious LLM workloads" decision and remains a sound 2026 choice when memory ceiling allows.

Skip this if you're standing up new cap-ex (the H200 at the same socket is almost always the better buy), single-card / no-NVLink-needed deployments (H100 PCIe is cheaper and easier), workload fits 48 GB (L40S wins), frontier-training where FP4 matters (B200), or you're a hobbyist (rent or buy consumer).

How it compares

  • vs H100 PCIe (80 GB) → Same chip, same 80 GB. SXM5 has full NVLink mesh + 700 W + DGX socket. PCIe has 350 W + standard PCIe form. Pick SXM5 for 4×–8× clusters; pick PCIe for 1–2 card deployments. See /compare/nvidia-h100-sxm-vs-nvidia-h100-pcie.
  • vs H200 (141 GB SXM) → Same socket, same architecture. H200 has 76% more memory + 43% more bandwidth at +5% price (DGX H200 vs DGX H100 in 2026 pricing). Pick H200 for any new build; H100 SXM only matches existing H100 cluster or finds steep discount. See /compare/nvidia-h100-sxm-vs-nvidia-h200.
  • vs B200 (192 GB SXM) → B200 has 2.4× memory + 2.4× bandwidth + native FP4 + Transformer Engine 2 at +33% price. Pick B200 for frontier training and FP4-aggressive production; H100 SXM for proven Hopper-tier production at lower cap-ex.
  • vs A100 80GB SXM → Same memory tier, A100 is one architecture generation older. H100 has FP8 + Transformer Engine + ~67% more bandwidth. Pick H100 SXM for FP8-exploiting workloads; A100 SXM for cost-conscious or matching existing A100 clusters.
  • vs MI300X (192 GB) → MI300X has 2.4× memory + 58% more bandwidth at often lower enterprise pricing — but ROCm vs CUDA ecosystem gap is real. Pick MI300X when memory ceiling unlocks workloads and ROCm fits the stack; H100 SXM when CUDA ecosystem maturity is non-negotiable.
BLK · OVERVIEW

Overview

What the H100 SXM actually is, in local-AI terms

The NVIDIA H100 SXM is the production datacenter GPU that defines the upper end of "self-hosted" local AI in 2026. 80 GB of HBM3 memory at ~3.35 TB/s memory bandwidth, the Hopper generation transformer engine with native FP8 acceleration, fourth-generation NVLink at 900 GB/s for multi-GPU scaling, and full software support across every leading-edge inference engine from TensorRT-LLM to vLLM.

It is also price-prohibitive for most "local AI" operators — a single H100 SXM module trades for roughly an order of magnitude more than an RTX 4090. The reason this page exists is not that most readers will buy one; it's that this card is the reference performance ceiling most other hardware is implicitly compared against, and understanding what it does — and where it doesn't — is essential context for picking anything below it.

Where it fits in the hardware ladder

The 2026 NVIDIA datacenter ladder:

GPU Mem BW Notes
L40S 48 GB 864 GB/s inference-tuned Ada-Lovelace
H100 PCIe 80 GB 2 TB/s datacenter, no NVLink at scale
H100 SXM 80 GB 3.35 TB/s datacenter, NVLink scale-out
H200 SXM 141 GB 4.8 TB/s next-gen capacity boost
B100 / B200 192 GB ~8 TB/s Blackwell — successor

vs the consumer ceiling:

GPU Mem BW Notes
RTX 4090 24 GB 1 TB/s consumer flagship
RTX 5090 32 GB 1.79 TB/s consumer next-gen
H100 SXM 80 GB 3.35 TB/s 2.5-3× the consumer ceiling

Best use cases

  • 70B-class production serving with concurrent users. A single H100 + vLLM + AWQ-INT4 or FP8 is the canonical multi-tenant setup.
  • Multi-tenant agentic platforms. SGLang on H100 with RadixAttention prefix-cache is the textbook high-throughput agentic backend.
  • Cluster scale-out via NVLink + InfiniBand. Where the H100 SXM truly differentiates from PCIe-only cards.
  • FP8 training and inference. Hopper's transformer engine + FP8 is the path with no consumer equivalent.
  • Self-hosted frontier model inference. DeepSeek V3, Llama 3.1 405B, and similar all assume H100-class infrastructure.

See /stacks/h100-tensor-parallel-workstation and /guides/running-local-ai-on-multiple-gpus-2026.

What it can run

Model class Quant Context Concurrency
7B FP16 128K massive
32B FP16 128K substantial
70B FP16 32K moderate
70B FP8 / AWQ-INT4 64-128K high
405B (8× H100) FP8 32K moderate

The 80 GB single-card capacity makes 70B FP16 trivial and 32B FP16 with massive concurrent users the production sweet spot. For 405B-class you need 4-8× H100s with NVLink + tensor-parallel.

OS support

OS Quality
Ubuntu 22.04 / 24.04 LTS excellent — the production reference
RHEL / Rocky Linux 8/9 excellent — common in enterprise datacenters
Other Linux partial — distro-dependent driver packaging
Windows not relevant — H100 is a datacenter card

H100 SXM modules ship in HGX baseboards and are not physically compatible with consumer motherboards. The H100 PCIe variant exists for non-HGX systems but lacks the SXM5 NVLink topology.

Software / runtime support

The H100 has the richest software stack of any GPU in this catalog:

  • TensorRT-LLM — NVIDIA's first-party serving engine; H100-tuned; FP8 transformer engine; the throughput-king path
  • vLLM — first-class H100 support with FP8 paths
  • SGLang — first-class H100 support with RadixAttention prefix-caching
  • PyTorch — first-class with cuDNN, Transformer Engine, FP8 paths
  • CUDA — reference platform; new CUDA features land H100-first

Quant formats: FP16, BF16, FP8 (Hopper-native), AWQ-INT4, GPTQ, GGUF, all supported. EXL2 / MLX-formats are off-target — H100 is not the right tool for those workloads.

What breaks first

  1. Cooling. SXM modules are designed for forced-air or liquid HGX baseboards; standalone deployments without proper airflow throttle within seconds.
  2. NVLink topology in multi-GPU configs. 4× and 8× H100 nodes have specific NVLink fabrics; misconfiguration hurts tensor-parallel scaling.
  3. NCCL version drift on cluster setups. Pin everything; mixed NCCL versions across nodes silently kill scaling.
  4. FP8 numerical stability. The transformer engine's FP8 paths are excellent for 90 % of models but require occasional per-layer precision overrides for stability on edge architectures.
  5. CUDA toolkit lag. New CUDA features land on H100 first but inference engines lag the toolkit by weeks.

Alternatives by intent

If you want… Reach for
Newer / more memory H200 (141 GB) or B100 / B200 (Blackwell)
Cheaper datacenter inference L40S (48 GB, ~1/3 the price)
Self-hosted "consumer" path RTX 4090 ×2 — much cheaper, much smaller models
Apple-ecosystem self-host Apple M3 Ultra 192 GB — bandwidth-rich, compute-poor
Cloud rental instead of buying most operators in 2026 should rent rather than own H100s

Best pairings

  • 8× H100 SXM HGX node + TensorRT-LLM + Llama 3.1 405B FP8 — the frontier-self-host configuration
  • Single H100 SXM + vLLM + 70B AWQ-INT4 — the production serving sweet spot
  • SGLang on H100 cluster — the agentic high-throughput pattern; RadixAttention pays off most when prefix-cache hit rates are high
  • NVLink + InfiniBand fabric + Slurm or Kubernetes orchestration — the datacenter operating model

Who should avoid the H100 SXM

  • Solo operators and homelabs. The price and infrastructure overhead don't pay back for single-user workloads. Use RTX 4090 or Apple M3 Ultra.
  • Anyone without HGX-compatible infrastructure. SXM modules are not consumer-installable.
  • Workloads that cap at 32B-class models. Massive overkill; an L40S or 4090 wins on price-perf.
  • Operators who would rent compute instead. Cloud H100 rental in 2026 is a better model than ownership for most workloads.

Related

  • Stacks: /stacks/h100-tensor-parallel-workstation
  • System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
  • Tools: TensorRT-LLM, vLLM, SGLang
Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Featured in this stack

The L3 execution stacks that pick this hardware as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Production tier·Role: GPUs (4× SXM5 in NVLink-Switch fabric)
    4× H100 SXM tensor-parallel workstation — frontier MoE serving reference

    H100 SXM5 with the NVLink-Switch chassis is the only consumer-tier-or-below configuration where total VRAM ≈ effective VRAM. The 900 GB/s mesh between all 4 cards makes tensor-parallel-4 essentially free vs the PCIe penalty consumer multi-GPU pays.

BLK · SPECS

Specs

VRAM80 GB
Power draw (peak)700 W
Released2022
MSRP$30000
Backends
CUDA

Models that fit

Open-weight models small enough to run on NVIDIA H100 SXM with usable context.

all-MiniLM-L6-v2
0.022B · other
FLUX.1 [dev]
12B · other
Qwen 3 0.6B
0.6B · qwen
BGE Large EN v1.5
0.335B · other
Nomic Embed Text v1.5
0.137B · other
Kokoro 82M
0.082B · other
Llama 3.1 8B Instruct
8B · llama
Qwen 3 30B-A3B
30B · qwen

Frequently asked

What models can NVIDIA H100 SXM run?

With 80GB VRAM, the NVIDIA H100 SXM runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA H100 SXM support CUDA?

Yes — NVIDIA H100 SXM is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Compare alternatives

Hardware worth comparing

The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.

Closest matches
Similar price, bandwidth & form factor
  • Intel Gaudi 3
    intel · 128 GB VRAM
    8.2/10
  • AMD Instinct MI250X
    amd · 128 GB VRAM
    9.7/10
  • NVIDIA H100 PCIe
    nvidia · 80 GB VRAM
    10.0/10
  • NVIDIA H200
    nvidia · 141 GB VRAM
    10.0/10
  • Intel Gaudi 2
    intel · 96 GB VRAM
    7.9/10
  • AMD Instinct MI325X
    amd · 256 GB VRAM
    10.0/10
Step up
More capable — more memory or a higher tier
  • Intel Gaudi 3
    intel · 128 GB VRAM
    8.2/10
  • AMD Instinct MI250X
    amd · 128 GB VRAM
    9.7/10
  • NVIDIA H200
    nvidia · 141 GB VRAM
    10.0/10
Step down
Lighter — cheaper or more constrained
  • Intel Gaudi 3
    intel · 128 GB VRAM
    8.2/10
  • AMD Instinct MI250X
    amd · 128 GB VRAM
    9.7/10
  • NVIDIA A100 80GB SXM
    nvidia · 80 GB VRAM
    9.7/10