UNIT · NVIDIA · GPU

80 GB VRAMworkstationReviewed June 2026

NVIDIA H100 SXM

diagram

Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

Hopper SXM5 — 80GB HBM3 at 3.35 TB/s. The original GPU that trained GPT-4. Cloud-rentable.

Released 2022·3350 GB/s memory bandwidth

▼ CHECK CURRENT PRICE· 1 retailer

NVIDIA H100 SXM

Check on Amazon

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE

See full leaderboard →

655/ 1000

BB-tier

Estimated

Throughput

500/ 500

VRAM-fit

190/ 200

Ecosystem

200/ 200

Efficiency

46/ 100

Sub-scores sum to 936 / 1000. Headline = 936 × 0.70 (Estimated-confidence discount) = 655. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 3350 GB/s bandwidth — 402.0 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT

Try other hardware →

Plain-English: Runs 70B comfortably — snappy enough for a coding agent; vision models supported.

7B chat✓

Comfortable

14B chat✓

Comfortable

32B chat✓

Comfortable

70B chat✓

Comfortable

Coding agent✓

Comfortable

Vision (≤8B VLM)✓

Comfortable

Long context (32K)✓

Comfortable

✓Comfortable — fits with headroom

~Tight — works, no slack

△Marginal — needs aggressive quant

✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

10.0/10

What it does well

The H100 SXM5 is the GPU that defined production LLM training and inference for the modern era. 80 GB HBM3 at 3.35 TB/s, 700 W TDP, and full NVLink mesh (900 GB/s between cards) at the SXM5 socket level — this is what an 8× DGX H100 box uses, and it's still the dominant deployment in 2026 hyperscaler cap-ex despite B200's Blackwell launch. Hopper architecture features are mature: native FP8 with first-gen Transformer Engine, dynamic FP8 scaling that delivers ~2× FP16 throughput on most modern frameworks, MIG (multi-instance GPU) for safe multi-tenant partitioning, and confidential computing extensions. The full NVIDIA stack is aggressively H100-tuned: TensorRT-LLM ships H100-specific kernels first, vLLM has the most-optimized H100 paths, and most production research papers from 2024–2025 cite H100 cluster training. Cap-ex around $30,000–$32,000 retail (or $25,000+ used as B200 ramps) and ~$3.50–$5.00/hr SXM rental — the standard datacenter inference / training tier when you need the SXM5 NVLink mesh advantage.

Where it breaks

Architecture is no longer current. B200 is the 2026 flagship: 192 GB / 8 TB/s / FP4 native. For new cap-ex on frontier training workloads, B200 is the right tier.
No FP4 native. Hopper has FP8 but not FP4 — frameworks now exploiting FP4 (TRT-LLM 0.10+, vLLM v0.7+, certain quantization libraries) get meaningful additional throughput on Blackwell that H100 can't match.
DGX motherboard requirement. SXM5 doesn't fit standard PCIe servers — you need a DGX-class chassis or an HGX baseboard from Supermicro / Dell / HPE. The motherboard premium is real.
Power and thermal density. 700 W TDP per card, 8-card baseboards pull 5.5+ kW continuous. This is datacenter-only — no office or even small colo deployment.
Memory ceiling vs H200 / MI300X. 80 GB. H200 at the same socket gives 141 GB. MI300X gives 192 GB. For memory-bound large-context inference, H100 SXM is the floor of the SXM-tier.
Resale erosion is starting. Used H100 SXM has dropped from $40,000+ peaks to ~$25,000. As B200 production ramps and H200 absorbs the upper tier, expect continued price softening over 2026.

Ideal model range

Sweet spot: 70B production multi-tenant serving via vLLM continuous batching at full FP8. ~150 concurrent users on a single 8× H100 DGX node.
Sweet spot: 200B-class production at FP8 across 4×–8× H100 SXM with full NVLink mesh.
Sweet spot: Frontier-model fine-tuning (70B FP16 full-finetune across 4× H100 SXM, or 200B+ across 8× H100) — the proven training tier.
Sweet spot: 405B production inference across 8× H100 SXM with NVLink mesh — the standard 2024–2025 deployment that's still common in 2026.
Stretch: 671B (DeepSeek V3 / R1) production serving across 8× H100 SXM with paged offload to system memory.
Comfortable: Anything an A100 80GB SXM does, but with FP8 throughput improvements and modern Transformer Engine optimizations.

Bad use cases

Single-card non-DGX deployments. Pick H100 PCIe instead — same chip, half the TDP, fits any PCIe server, ~25% cheaper.
Hobbyist / single-developer workloads. Wrong tier entirely. Rent for hours; don't buy.
Anything that fits 48 GB. L40S at 1/4 the cap-ex wins for production sub-48 GB inference.
New cap-ex when H200 exists. H200 is the same socket with 76% more memory + 43% more bandwidth at +25% price. Almost always the better buy in 2026.
Frontier training where FP4 / Blackwell-gen TE2 dominate. Pick B200.
Cap-ex without a 24×7 high-utilization workload. Rent on Runpod / Lambda at $3.50–$5/hr SXM.

Verdict

Buy this if you're operating production datacenter training or inference at multi-card scale, you need full SXM5 NVLink mesh (8×-card tensor parallelism with 900 GB/s interconnect), you have or are deploying DGX-class infrastructure, and you've validated cap-ex over 18+ month horizon vs rental. H100 SXM5 is the canonical "I run an 8× DGX H100 box for serious LLM workloads" decision and remains a sound 2026 choice when memory ceiling allows.

Skip this if you're standing up new cap-ex (the H200 at the same socket is almost always the better buy), single-card / no-NVLink-needed deployments (H100 PCIe is cheaper and easier), workload fits 48 GB (L40S wins), frontier-training where FP4 matters (B200), or you're a hobbyist (rent or buy consumer).

How it compares

vs H100 PCIe (80 GB) → Same chip, same 80 GB. SXM5 has full NVLink mesh + 700 W + DGX socket. PCIe has 350 W + standard PCIe form. Pick SXM5 for 4×–8× clusters; pick PCIe for 1–2 card deployments. See /compare/nvidia-h100-sxm-vs-nvidia-h100-pcie.
vs H200 (141 GB SXM) → Same socket, same architecture. H200 has 76% more memory + 43% more bandwidth at +5% price (DGX H200 vs DGX H100 in 2026 pricing). Pick H200 for any new build; H100 SXM only matches existing H100 cluster or finds steep discount. See /compare/nvidia-h100-sxm-vs-nvidia-h200.
vs B200 (192 GB SXM) → B200 has 2.4× memory + 2.4× bandwidth + native FP4 + Transformer Engine 2 at +33% price. Pick B200 for frontier training and FP4-aggressive production; H100 SXM for proven Hopper-tier production at lower cap-ex.
vs A100 80GB SXM → Same memory tier, A100 is one architecture generation older. H100 has FP8 + Transformer Engine + ~67% more bandwidth. Pick H100 SXM for FP8-exploiting workloads; A100 SXM for cost-conscious or matching existing A100 clusters.
vs MI300X (192 GB) → MI300X has 2.4× memory + 58% more bandwidth at often lower enterprise pricing — but ROCm vs CUDA ecosystem gap is real. Pick MI300X when memory ceiling unlocks workloads and ROCm fits the stack; H100 SXM when CUDA ecosystem maturity is non-negotiable.

BLK · OVERVIEW

Overview

What the H100 SXM actually is, in local-AI terms

The NVIDIA H100 SXM is the production datacenter GPU that defines the upper end of "self-hosted" local AI in 2026. 80 GB of HBM3 memory at ~3.35 TB/s memory bandwidth, the Hopper generation transformer engine with native FP8 acceleration, fourth-generation NVLink at 900 GB/s for multi-GPU scaling, and full software support across every leading-edge inference engine from TensorRT-LLM to vLLM.

It is also price-prohibitive for most "local AI" operators — a single H100 SXM module trades for roughly an order of magnitude more than an RTX 4090. The reason this page exists is not that most readers will buy one; it's that this card is the reference performance ceiling most other hardware is implicitly compared against, and understanding what it does — and where it doesn't — is essential context for picking anything below it.

Where it fits in the hardware ladder

The 2026 NVIDIA datacenter ladder:

GPU	Mem	BW	Notes
L40S	48 GB	864 GB/s	inference-tuned Ada-Lovelace
H100 PCIe	80 GB	2 TB/s	datacenter, no NVLink at scale
H100 SXM	80 GB	3.35 TB/s	datacenter, NVLink scale-out
H200 SXM	141 GB	4.8 TB/s	next-gen capacity boost
B100 / B200	192 GB	~8 TB/s	Blackwell — successor

vs the consumer ceiling:

GPU	Mem	BW	Notes
RTX 4090	24 GB	1 TB/s	consumer flagship
RTX 5090	32 GB	1.79 TB/s	consumer next-gen
H100 SXM	80 GB	3.35 TB/s	2.5-3× the consumer ceiling

Best use cases

70B-class production serving with concurrent users. A single H100 + vLLM + AWQ-INT4 or FP8 is the canonical multi-tenant setup.
Multi-tenant agentic platforms. SGLang on H100 with RadixAttention prefix-cache is the textbook high-throughput agentic backend.
Cluster scale-out via NVLink + InfiniBand. Where the H100 SXM truly differentiates from PCIe-only cards.
FP8 training and inference. Hopper's transformer engine + FP8 is the path with no consumer equivalent.
Self-hosted frontier model inference. DeepSeek V3, Llama 3.1 405B, and similar all assume H100-class infrastructure.

See /stacks/h100-tensor-parallel-workstation and /guides/running-local-ai-on-multiple-gpus-2026.

What it can run

Model class	Quant	Context	Concurrency
7B	FP16	128K	massive
32B	FP16	128K	substantial
70B	FP16	32K	moderate
70B	FP8 / AWQ-INT4	64-128K	high
405B (8× H100)	FP8	32K	moderate

The 80 GB single-card capacity makes 70B FP16 trivial and 32B FP16 with massive concurrent users the production sweet spot. For 405B-class you need 4-8× H100s with NVLink + tensor-parallel.

OS support

OS	Quality
Ubuntu 22.04 / 24.04 LTS	excellent — the production reference
RHEL / Rocky Linux 8/9	excellent — common in enterprise datacenters
Other Linux	partial — distro-dependent driver packaging
Windows	not relevant — H100 is a datacenter card

H100 SXM modules ship in HGX baseboards and are not physically compatible with consumer motherboards. The H100 PCIe variant exists for non-HGX systems but lacks the SXM5 NVLink topology.

Software / runtime support

The H100 has the richest software stack of any GPU in this catalog:

TensorRT-LLM — NVIDIA's first-party serving engine; H100-tuned; FP8 transformer engine; the throughput-king path
vLLM — first-class H100 support with FP8 paths
SGLang — first-class H100 support with RadixAttention prefix-caching
PyTorch — first-class with cuDNN, Transformer Engine, FP8 paths
CUDA — reference platform; new CUDA features land H100-first

Quant formats: FP16, BF16, FP8 (Hopper-native), AWQ-INT4, GPTQ, GGUF, all supported. EXL2 / MLX-formats are off-target — H100 is not the right tool for those workloads.

What breaks first

Cooling. SXM modules are designed for forced-air or liquid HGX baseboards; standalone deployments without proper airflow throttle within seconds.
NVLink topology in multi-GPU configs. 4× and 8× H100 nodes have specific NVLink fabrics; misconfiguration hurts tensor-parallel scaling.
NCCL version drift on cluster setups. Pin everything; mixed NCCL versions across nodes silently kill scaling.
FP8 numerical stability. The transformer engine's FP8 paths are excellent for 90 % of models but require occasional per-layer precision overrides for stability on edge architectures.
CUDA toolkit lag. New CUDA features land on H100 first but inference engines lag the toolkit by weeks.

Alternatives by intent

If you want…	Reach for
Newer / more memory	H200 (141 GB) or B100 / B200 (Blackwell)
Cheaper datacenter inference	L40S (48 GB, ~1/3 the price)
Self-hosted "consumer" path	RTX 4090 ×2 — much cheaper, much smaller models
Apple-ecosystem self-host	Apple M3 Ultra 192 GB — bandwidth-rich, compute-poor
Cloud rental instead of buying	most operators in 2026 should rent rather than own H100s

Best pairings

8× H100 SXM HGX node + TensorRT-LLM + Llama 3.1 405B FP8 — the frontier-self-host configuration
Single H100 SXM + vLLM + 70B AWQ-INT4 — the production serving sweet spot
SGLang on H100 cluster — the agentic high-throughput pattern; RadixAttention pays off most when prefix-cache hit rates are high
NVLink + InfiniBand fabric + Slurm or Kubernetes orchestration — the datacenter operating model

Who should avoid the H100 SXM

Solo operators and homelabs. The price and infrastructure overhead don't pay back for single-user workloads. Use RTX 4090 or Apple M3 Ultra.
Anyone without HGX-compatible infrastructure. SXM modules are not consumer-installable.
Workloads that cap at 32B-class models. Massive overkill; an L40S or 4090 wins on price-perf.
Operators who would rent compute instead. Cloud H100 rental in 2026 is a better model than ownership for most workloads.

Stacks: /stacks/h100-tensor-parallel-workstation
System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
Tools: TensorRT-LLM, vLLM, SGLang

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Featured in this stack

The L3 execution stacks that pick this hardware as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Production tier·Role: GPUs (4× SXM5 in NVLink-Switch fabric)
4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
H100 SXM5 with the NVLink-Switch chassis is the only consumer-tier-or-below configuration where total VRAM ≈ effective VRAM. The 900 GB/s mesh between all 4 cards makes tensor-parallel-4 essentially free vs the PCIe penalty consumer multi-GPU pays.

BLK · SPECS

Specs

VRAM	80 GB
Power draw (peak)	700 W
Released	2022
MSRP	$30000
Backends	CUDA

Models that fit

Open-weight models small enough to run on NVIDIA H100 SXM with usable context.

Nomic Embed Text v1.5

0.137B · other

Kokoro 82M

0.082B · other

Llama 3.1 8B Instruct

8B · llama

Qwen 3 30B-A3B

30B · qwen

Frequently asked

What models can NVIDIA H100 SXM run?

With 80GB VRAM, the NVIDIA H100 SXM runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA H100 SXM support CUDA?

Yes — NVIDIA H100 SXM is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Buyer guides

Troubleshooting

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

UNIT · NVIDIA · GPU

80 GB VRAMworkstationReviewed June 2026

NVIDIA H100 SXM

diagram

Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

Hopper SXM5 — 80GB HBM3 at 3.35 TB/s. The original GPU that trained GPT-4. Cloud-rentable.

Released 2022·3350 GB/s memory bandwidth

▼ CHECK CURRENT PRICE· 1 retailer

NVIDIA H100 SXM

Check on Amazon

RUNLOCALAI SCORE

See full leaderboard →

655/ 1000

BB-tier

Estimated

Throughput

500/ 500

VRAM-fit

190/ 200

Ecosystem

200/ 200

Efficiency

46/ 100

Extrapolated from 3350 GB/s bandwidth — 402.0 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT

Try other hardware →

Plain-English: Runs 70B comfortably — snappy enough for a coding agent; vision models supported.

7B chat✓

Comfortable

14B chat✓

Comfortable

32B chat✓

Comfortable

70B chat✓

Comfortable

Coding agent✓

Comfortable

Vision (≤8B VLM)✓

Comfortable

Long context (32K)✓

Comfortable

✓Comfortable — fits with headroom

~Tight — works, no slack

△Marginal — needs aggressive quant

✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

10.0/10

What it does well

Where it breaks

Architecture is no longer current. B200 is the 2026 flagship: 192 GB / 8 TB/s / FP4 native. For new cap-ex on frontier training workloads, B200 is the right tier.
No FP4 native. Hopper has FP8 but not FP4 — frameworks now exploiting FP4 (TRT-LLM 0.10+, vLLM v0.7+, certain quantization libraries) get meaningful additional throughput on Blackwell that H100 can't match.
DGX motherboard requirement. SXM5 doesn't fit standard PCIe servers — you need a DGX-class chassis or an HGX baseboard from Supermicro / Dell / HPE. The motherboard premium is real.
Power and thermal density. 700 W TDP per card, 8-card baseboards pull 5.5+ kW continuous. This is datacenter-only — no office or even small colo deployment.
Memory ceiling vs H200 / MI300X. 80 GB. H200 at the same socket gives 141 GB. MI300X gives 192 GB. For memory-bound large-context inference, H100 SXM is the floor of the SXM-tier.
Resale erosion is starting. Used H100 SXM has dropped from $40,000+ peaks to ~$25,000. As B200 production ramps and H200 absorbs the upper tier, expect continued price softening over 2026.

Ideal model range

Sweet spot: 70B production multi-tenant serving via vLLM continuous batching at full FP8. ~150 concurrent users on a single 8× H100 DGX node.
Sweet spot: 200B-class production at FP8 across 4×–8× H100 SXM with full NVLink mesh.
Sweet spot: Frontier-model fine-tuning (70B FP16 full-finetune across 4× H100 SXM, or 200B+ across 8× H100) — the proven training tier.
Sweet spot: 405B production inference across 8× H100 SXM with NVLink mesh — the standard 2024–2025 deployment that's still common in 2026.
Stretch: 671B (DeepSeek V3 / R1) production serving across 8× H100 SXM with paged offload to system memory.
Comfortable: Anything an A100 80GB SXM does, but with FP8 throughput improvements and modern Transformer Engine optimizations.

Bad use cases

Single-card non-DGX deployments. Pick H100 PCIe instead — same chip, half the TDP, fits any PCIe server, ~25% cheaper.
Hobbyist / single-developer workloads. Wrong tier entirely. Rent for hours; don't buy.
Anything that fits 48 GB. L40S at 1/4 the cap-ex wins for production sub-48 GB inference.
New cap-ex when H200 exists. H200 is the same socket with 76% more memory + 43% more bandwidth at +25% price. Almost always the better buy in 2026.
Frontier training where FP4 / Blackwell-gen TE2 dominate. Pick B200.
Cap-ex without a 24×7 high-utilization workload. Rent on Runpod / Lambda at $3.50–$5/hr SXM.

Verdict

How it compares

vs H100 PCIe (80 GB) → Same chip, same 80 GB. SXM5 has full NVLink mesh + 700 W + DGX socket. PCIe has 350 W + standard PCIe form. Pick SXM5 for 4×–8× clusters; pick PCIe for 1–2 card deployments. See /compare/nvidia-h100-sxm-vs-nvidia-h100-pcie.
vs H200 (141 GB SXM) → Same socket, same architecture. H200 has 76% more memory + 43% more bandwidth at +5% price (DGX H200 vs DGX H100 in 2026 pricing). Pick H200 for any new build; H100 SXM only matches existing H100 cluster or finds steep discount. See /compare/nvidia-h100-sxm-vs-nvidia-h200.
vs B200 (192 GB SXM) → B200 has 2.4× memory + 2.4× bandwidth + native FP4 + Transformer Engine 2 at +33% price. Pick B200 for frontier training and FP4-aggressive production; H100 SXM for proven Hopper-tier production at lower cap-ex.
vs A100 80GB SXM → Same memory tier, A100 is one architecture generation older. H100 has FP8 + Transformer Engine + ~67% more bandwidth. Pick H100 SXM for FP8-exploiting workloads; A100 SXM for cost-conscious or matching existing A100 clusters.
vs MI300X (192 GB) → MI300X has 2.4× memory + 58% more bandwidth at often lower enterprise pricing — but ROCm vs CUDA ecosystem gap is real. Pick MI300X when memory ceiling unlocks workloads and ROCm fits the stack; H100 SXM when CUDA ecosystem maturity is non-negotiable.

BLK · OVERVIEW

Overview

What the H100 SXM actually is, in local-AI terms

Where it fits in the hardware ladder

The 2026 NVIDIA datacenter ladder:

GPU	Mem	BW	Notes
L40S	48 GB	864 GB/s	inference-tuned Ada-Lovelace
H100 PCIe	80 GB	2 TB/s	datacenter, no NVLink at scale
H100 SXM	80 GB	3.35 TB/s	datacenter, NVLink scale-out
H200 SXM	141 GB	4.8 TB/s	next-gen capacity boost
B100 / B200	192 GB	~8 TB/s	Blackwell — successor

vs the consumer ceiling:

GPU	Mem	BW	Notes
RTX 4090	24 GB	1 TB/s	consumer flagship
RTX 5090	32 GB	1.79 TB/s	consumer next-gen
H100 SXM	80 GB	3.35 TB/s	2.5-3× the consumer ceiling

Best use cases

70B-class production serving with concurrent users. A single H100 + vLLM + AWQ-INT4 or FP8 is the canonical multi-tenant setup.
Multi-tenant agentic platforms. SGLang on H100 with RadixAttention prefix-cache is the textbook high-throughput agentic backend.
Cluster scale-out via NVLink + InfiniBand. Where the H100 SXM truly differentiates from PCIe-only cards.
FP8 training and inference. Hopper's transformer engine + FP8 is the path with no consumer equivalent.
Self-hosted frontier model inference. DeepSeek V3, Llama 3.1 405B, and similar all assume H100-class infrastructure.

See /stacks/h100-tensor-parallel-workstation and /guides/running-local-ai-on-multiple-gpus-2026.

What it can run

Model class	Quant	Context	Concurrency
7B	FP16	128K	massive
32B	FP16	128K	substantial
70B	FP16	32K	moderate
70B	FP8 / AWQ-INT4	64-128K	high
405B (8× H100)	FP8	32K	moderate

The 80 GB single-card capacity makes 70B FP16 trivial and 32B FP16 with massive concurrent users the production sweet spot. For 405B-class you need 4-8× H100s with NVLink + tensor-parallel.

OS support

OS	Quality
Ubuntu 22.04 / 24.04 LTS	excellent — the production reference
RHEL / Rocky Linux 8/9	excellent — common in enterprise datacenters
Other Linux	partial — distro-dependent driver packaging
Windows	not relevant — H100 is a datacenter card

H100 SXM modules ship in HGX baseboards and are not physically compatible with consumer motherboards. The H100 PCIe variant exists for non-HGX systems but lacks the SXM5 NVLink topology.

Software / runtime support

The H100 has the richest software stack of any GPU in this catalog:

TensorRT-LLM — NVIDIA's first-party serving engine; H100-tuned; FP8 transformer engine; the throughput-king path
vLLM — first-class H100 support with FP8 paths
SGLang — first-class H100 support with RadixAttention prefix-caching
PyTorch — first-class with cuDNN, Transformer Engine, FP8 paths
CUDA — reference platform; new CUDA features land H100-first

Quant formats: FP16, BF16, FP8 (Hopper-native), AWQ-INT4, GPTQ, GGUF, all supported. EXL2 / MLX-formats are off-target — H100 is not the right tool for those workloads.

What breaks first

Cooling. SXM modules are designed for forced-air or liquid HGX baseboards; standalone deployments without proper airflow throttle within seconds.
NVLink topology in multi-GPU configs. 4× and 8× H100 nodes have specific NVLink fabrics; misconfiguration hurts tensor-parallel scaling.
NCCL version drift on cluster setups. Pin everything; mixed NCCL versions across nodes silently kill scaling.
FP8 numerical stability. The transformer engine's FP8 paths are excellent for 90 % of models but require occasional per-layer precision overrides for stability on edge architectures.
CUDA toolkit lag. New CUDA features land on H100 first but inference engines lag the toolkit by weeks.

Alternatives by intent

If you want…	Reach for
Newer / more memory	H200 (141 GB) or B100 / B200 (Blackwell)
Cheaper datacenter inference	L40S (48 GB, ~1/3 the price)
Self-hosted "consumer" path	RTX 4090 ×2 — much cheaper, much smaller models
Apple-ecosystem self-host	Apple M3 Ultra 192 GB — bandwidth-rich, compute-poor
Cloud rental instead of buying	most operators in 2026 should rent rather than own H100s

Best pairings

8× H100 SXM HGX node + TensorRT-LLM + Llama 3.1 405B FP8 — the frontier-self-host configuration
Single H100 SXM + vLLM + 70B AWQ-INT4 — the production serving sweet spot
SGLang on H100 cluster — the agentic high-throughput pattern; RadixAttention pays off most when prefix-cache hit rates are high
NVLink + InfiniBand fabric + Slurm or Kubernetes orchestration — the datacenter operating model

Who should avoid the H100 SXM

Solo operators and homelabs. The price and infrastructure overhead don't pay back for single-user workloads. Use RTX 4090 or Apple M3 Ultra.
Anyone without HGX-compatible infrastructure. SXM modules are not consumer-installable.
Workloads that cap at 32B-class models. Massive overkill; an L40S or 4090 wins on price-perf.
Operators who would rent compute instead. Cloud H100 rental in 2026 is a better model than ownership for most workloads.

Stacks: /stacks/h100-tensor-parallel-workstation
System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
Tools: TensorRT-LLM, vLLM, SGLang

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Featured in this stack

The L3 execution stacks that pick this hardware as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Production tier·Role: GPUs (4× SXM5 in NVLink-Switch fabric)
4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
H100 SXM5 with the NVLink-Switch chassis is the only consumer-tier-or-below configuration where total VRAM ≈ effective VRAM. The 900 GB/s mesh between all 4 cards makes tensor-parallel-4 essentially free vs the PCIe penalty consumer multi-GPU pays.

BLK · SPECS

Specs

VRAM	80 GB
Power draw (peak)	700 W
Released	2022
MSRP	$30000
Backends	CUDA

Models that fit

Open-weight models small enough to run on NVIDIA H100 SXM with usable context.

Nomic Embed Text v1.5

0.137B · other

Kokoro 82M

0.082B · other

Llama 3.1 8B Instruct

8B · llama

Qwen 3 30B-A3B

30B · qwen

Frequently asked

What models can NVIDIA H100 SXM run?

With 80GB VRAM, the NVIDIA H100 SXM runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA H100 SXM support CUDA?

Yes — NVIDIA H100 SXM is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Buyer guides

Troubleshooting

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

What the H100 SXM actually is, in local-AI terms

Where it fits in the hardware ladder

Best use cases

What it can run

OS support

Software / runtime support

What breaks first

Alternatives by intent

Best pairings

Who should avoid the H100 SXM

Related

Featured in this stack

Specs

Models that fit

Frequently asked

What models can NVIDIA H100 SXM run?

Does NVIDIA H100 SXM support CUDA?

Where next?

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

What the H100 SXM actually is, in local-AI terms

Where it fits in the hardware ladder

Best use cases

What it can run

OS support

Software / runtime support

What breaks first

Alternatives by intent

Best pairings

Who should avoid the H100 SXM

Related

Featured in this stack

Specs

Models that fit

Frequently asked

What models can NVIDIA H100 SXM run?

Does NVIDIA H100 SXM support CUDA?

Where next?