What models can Intel Gaudi 3 run?

With 128GB VRAM, the Intel Gaudi 3 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does Intel Gaudi 3 support CUDA?

Intel Gaudi 3 does not support CUDA. Use Vulkan-compatible tools (llama.cpp Vulkan backend) or check vendor-specific runtimes.

Intel Gaudi 3 for local AI

What it does well

The Gaudi 3 is Intel's most credible LLM accelerator to date and the closest the Intel ecosystem has come to competing on production inference economics. 128 GB HBM2e at 3.7 TB/s, 24 dedicated 200 Gbps RoCEv2 NICs for cluster scale-out, and a sparse-tensor compute architecture that's particularly strong on transformer attention patterns. At ~$18,000 retail (often deeply discounted on enterprise quotes), Gaudi 3 is roughly 60% the price of an H100 SXM at similar memory tier — and Intel's software story has matured: PyTorch 2.5+ supports Gaudi via SynapseAI runtime, Hugging Face Optimum-Habana wraps standard transformers code with minimal changes, and vLLM gained Gaudi support in late 2024. The card's real strength is rack-scale production: 8× Gaudi 3 servers (1 TB combined memory) at substantially lower TCO than 8× H100 SXM nodes when the workload tolerates Intel's ecosystem path. Intel's enterprise sales motion (free integration support, generous trial periods) is real for buyers willing to engage.

Where it breaks

Software ecosystem is third place behind NVIDIA + AMD. SynapseAI / Optimum-Habana are functional but the framework + tooling + community + day-zero new model support all lag both CUDA and ROCm. Niche frameworks may not run at all; popular ones often need workarounds. If your team needs to deploy something tomorrow, Gaudi 3 is high-friction.
No FP8 native equivalent to Hopper / Ada. Gaudi 3 has BF16/FP16 first-class with INT8 quantization paths, but doesn't deliver the FP8 throughput of H100/H200 / B200. For workloads that aggressively exploit FP8 (most modern frameworks now do), the architecture-specific gap shows up.
Smaller cloud rental availability. Intel Tiber AI Cloud and select OEMs offer Gaudi rental, but availability is dramatically thinner than NVIDIA on Runpod / Lambda / Together. You can't easily spin up Gaudi for a weekend of experimentation.
Resale and used-market liquidity is very thin. If cap-ex doesn't pay off, exit pricing is hard to predict.
Architecture is essentially Habana's, not Intel's deep silicon roadmap. Intel acquired Habana in 2019; Gaudi roadmap continuity over 5+ years is harder to bet on than NVIDIA's, especially after Intel's broader AI strategy shifts.
No real story for fine-tuning at scale. Inference is the focused workload. Training/fine-tuning paths exist but have substantially less framework support.

Ideal model range

Sweet spot: 70B–200B production inference at FP16 / BF16 with multi-tenant serving. The 128 GB memory ceiling fits 70B FP16 with 32K context, 32B FP16 with 200K context, or multi-model agentic stacks.
Sweet spot: 8× Gaudi 3 cluster (1 TB combined) for 405B-class production inference at substantially lower TCO than NVIDIA equivalents — when the ecosystem fits.
Sweet spot: Production deployments where the operator already has Intel-aligned datacenter infrastructure (Optane, Sapphire Rapids, etc.) and is willing to absorb integration cost.
Sweet spot: BF16-friendly workloads — Gaudi 3 is genuinely strong on BF16 throughput.
Stretch: Larger MoE models (DeepSeek V3 at Q3, Qwen 235B at FP8) — fits memory but FP8 software paths are less optimized.

Bad use cases

Hobbyist / single-developer workloads. Wrong tier entirely. No reasonable path to a personal Gaudi 3.
CUDA-locked stacks. Don't try to outwit your existing stack. Pick CUDA hardware.
Day-zero new model architectures. Gaudi support arrives later than NVIDIA / AMD for most cutting-edge models.
Frontier training where FP4 throughput dominates. B200 is the right tier.
Anything that fits 80 GB. H100 PCIe or L40S wins on ecosystem and is cheaper / similar TCO.
Cap-ex without dedicated SynapseAI engineering capacity. Production Gaudi requires Intel-specific in-house engineering. Budget for it.

Verdict

Buy this if you're operating production inference at 70B–200B+ scale and you have specific reason to deploy Intel (alignment with Sapphire Rapids datacenter, Habana SDK familiarity, vendor diversification away from NVIDIA, or substantially better $/throughput on validated workloads), you have SynapseAI engineering capacity, and you've validated Gaudi 3 with your specific serving framework. The Gaudi 3 is the right pick for buyers who can absorb integration cost and whose workloads benefit from the architecture's BF16 + sparse-tensor strengths.

Skip this if your stack is CUDA / ROCm-aligned, you need day-zero new-model support, you're a hobbyist or single-user (wrong tier), your workloads fit 80 GB (H100 PCIe or L40S wins ecosystem), or you can't budget Intel-specific engineering time. For most reader queries about "should I use Intel Gaudi instead of NVIDIA," the honest answer is: only if you have a specific Intel-alignment reason.

How it compares

vs Gaudi 2 (96 GB) → Gaudi 3 has 33% more memory + ~50% more bandwidth + 2× scale-out networking + architectural refinements. Pick Gaudi 3 for new Intel builds; Gaudi 2 only for existing fleet matching at the right price discount.
vs H100 SXM (80 GB) → Gaudi 3 has 60% more memory at ~60% the price + similar bandwidth. H100 SXM has the entire NVIDIA ecosystem advantage + FP8 + NVLink mesh maturity. Pick Gaudi 3 for cost-conscious Intel-aligned inference where ecosystem is acceptable; H100 SXM for production-grade certainty.
vs H200 (141 GB SXM) → H200 has 10% more memory + 30% more bandwidth + full NVIDIA ecosystem at +70% price. Pick H200 for production certainty; Gaudi 3 only when Intel-alignment + cost reduction together justify the ecosystem trade.
vs MI300X (192 GB) → MI300X has 50% more memory + 43% more bandwidth + ROCm ecosystem (more mature than SynapseAI for most workloads). Pick MI300X over Gaudi 3 in nearly all "non-NVIDIA but want cost savings" scenarios — ROCm is in better shape than SynapseAI for production LLM inference in 2026.
vs L40S (48 GB) → L40S at 1/2 the price + Ada-gen ecosystem wins on most production inference under 48 GB. Gaudi 3 only makes sense when 128 GB on one card matters and you accept the SynapseAI integration tax.

VRAM	128 GB
Power draw (peak)	900 W
Released	2024
MSRP	$18000
Backends

Intel Gaudi 3

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Hardware worth comparing

Frequently asked

What models can Intel Gaudi 3 run?

Does Intel Gaudi 3 support CUDA?

Where next?