AMD Instinct MI355X
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Latest CDNA 4. 288GB HBM3e — currently the highest VRAM per chip on the market.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 894 / 1000. Headline = 894 × 0.70 (Estimated-confidence discount) = 626. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 8000 GB/s bandwidth — 800.0 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B comfortably — snappy enough for a coding agent.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The MI355X is AMD's 2026 datacenter flagship and the first AMD card to credibly compete on the frontier-tier with B200. 288 GB HBM3e at 6.3 TB/s — that's 50% more memory than B200 and ~80% the bandwidth at substantially lower enterprise pricing ($25,000 list vs ~$40,000 for B200). Architecture-wise, MI355X is built on AMD's CDNA 3.5 refresh with improved FP8 throughput, optional FP6/FP4 paths via software emulation, and meaningfully better Infinity Fabric for 8-card rack deployments. The headline LLM workload: a single MI355X fits Llama 3.3 405B FP16 with comfortable context, DeepSeek V3 671B at FP8 with 32K context, or any production-tier 200B-class model at unrealistic-on-NVIDIA quality levels. ROCm 6.4+ has reached genuine inference parity for any production workload that targets vLLM / SGLang / Hugging Face. Cloud rental on TensorWave, Hot Aisle, and select RunPod tiers comes in at ~$3.50–$5.00/hr — typically beating B200 rental by 25–35%.
Where it breaks
- Software stack still trails CUDA on the long tail. ROCm has caught up dramatically on inference but framework-level optimizations, day-zero new model support, and certain quantization libraries (TensorRT-LLM remains CUDA-only) lag NVIDIA by weeks-to-months. Production ROCm deployments require in-house engineering capacity that NVIDIA workloads don't.
- No FP4 native — FP4 throughput is software-emulated. B200's headline feature (native FP4 with second-gen Transformer Engine) doesn't have a hardware equivalent on MI355X. For workloads that aggressively exploit FP4 throughput, B200 wins on architecture-specific gains regardless of price.
- Enterprise availability is constrained. MI355X cap-ex is harder to procure than NVIDIA equivalents — fewer integrators, longer lead times, less price discovery in secondary markets.
- Driver / kernel module discipline. Production ROCm requires tighter coupling between kernel module + dkms + matching userspace than NVIDIA's mature single-installer story. First-time MI355X-on-Linux is real engineering work.
- Training framework coverage is uneven. Pure inference is at parity. Training-side support (DeepSpeed, Megatron-LM, certain LoRA libraries) varies. Pure-inference deployments are the strongest fit; mixed inference+training is harder.
Ideal model range
- Sweet spot: 405B–671B production inference at FP8 / Q4. The 288 GB memory ceiling unlocks single-card workloads NVIDIA equivalents below B200 cannot fit at all.
- Sweet spot: Long-context production at the 200B–405B tier (64K–256K contexts where bandwidth dominates).
- Sweet spot: Multi-tenant 70B–200B production serving via vLLM continuous batching with 32–64 concurrent users.
- Sweet spot: 8× MI355X cluster for trillion-parameter+ class inference (2,304 GB combined memory).
- Stretch: Frontier-model fine-tuning at 200B-class FP8 / 70B FP16.
- Comfortable: Anything that runs on ROCm — embedding models, classifiers at high concurrency, smaller LMs.
Bad use cases
- CUDA-locked stacks. Don't pick MI355X if your team's tooling requires CUDA-only frameworks and you can't budget integration time.
- FP4-aggressive frontier training. B200 is the right tier when FP4 throughput materially helps.
- Hobbyist / single-developer workloads. Wrong tier entirely.
- Anything that fits 192 GB. MI300X at lower cap-ex covers most workloads under 192 GB.
- Cap-ex without ROCm engineering capacity. Production AMD requires more in-house engineering than NVIDIA. Budget for it explicitly.
Verdict
Buy this if you're operating production inference at frontier scale (405B–671B+), you have ROCm engineering capacity (in-house or via vendor), the 288 GB single-card memory ceiling genuinely unlocks workloads B200 / MI300X can't fit on one card, and you've validated MI355X with your serving framework. The MI355X is the right pick for the highest-memory single-card datacenter inference at AMD pricing — when ROCm fits, it's a legitimate B200 competitor.
Skip this if your stack is CUDA-only, your workloads fit 192 GB (MI300X is the value pick), you're frontier-training where FP4 / TE2 dominate (B200), you need ecosystem maturity over memory ceiling (H200 or B200), or you're a hobbyist (consumer NVIDIA wins by far).
How it compares
- vs MI325X (256 GB) → MI355X has 12% more memory + faster HBM3e + CDNA 3.5 refresh. Pick MI355X for new builds when available; MI325X for value pick at the 256 GB tier or when MI355X is supply-constrained. See /compare/amd-mi355x-vs-amd-mi325x.
- vs MI300X (192 GB) → MI355X has 50% more memory + ~19% more bandwidth + architecture refresh. Pick MI300X for cost-conscious 192 GB-or-less workloads; MI355X when 288 GB on one card matters.
- vs B200 (192 GB) → MI355X has 50% more memory at lower cap-ex. B200 has 27% more bandwidth (8 TB/s vs 6.3) + native FP4 + Transformer Engine 2 + the entire NVIDIA ecosystem advantage at substantial price premium. Pick B200 for frontier training and FP4-exploiting production; MI355X for cost-sensitive memory-bound serving where ROCm fits. See /compare/amd-mi355x-vs-nvidia-b200.
- vs H200 (141 GB SXM) → MI355X has 2× the memory + 30% more bandwidth at lower price. H200 has full NVIDIA ecosystem maturity. Pick MI355X when memory ceiling and price matter most; H200 when ecosystem certainty is non-negotiable.
- vs renting on TensorWave / Hot Aisle → Cloud rental at $3.50–$5/hr is typically 25–35% cheaper than B200 rental and ~10–20% more than MI300X. Cap-ex breakeven similar to B200 (9–12 months 24×7). Always rent MI355X first to validate ROCm fit before $25,000 cap-ex commitment.
Overview
Latest CDNA 4. 288GB HBM3e — currently the highest VRAM per chip on the market.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 288 GB |
| Power draw (peak) | 1000 W |
| Released | 2025 |
| MSRP | $25000 |
| Backends | ROCm |
Models that fit
Open-weight models small enough to run on AMD Instinct MI355X with usable context.
Frequently asked
What models can AMD Instinct MI355X run?
Does AMD Instinct MI355X support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.