NVIDIA A100 40GB
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Original A100. 40GB HBM2 at 1.55 TB/s. Trained the early generation of frontier models.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 907 / 1000. Headline = 907 × 0.70 (Estimated-confidence discount) = 635. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 1555 GB/s bandwidth — 186.6 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The A100 40GB is the cost-floor entry into datacenter-grade NVIDIA hardware in 2026. 40 GB HBM2 at 1.55 TB/s + Ampere generation tensor cores + the full CUDA datacenter stack — all at ~$11,000 retail (or $7,000–$9,000 well-circulated used). For workloads that fit 40 GB, it's still genuinely competitive: 70B at Q3 with shorter context, 32B FP16 with 32K context, multi-tenant 13B serving via vLLM. The card defined the LLM training era — every Llama 1, every GPT-3.5 era model, the original Stable Diffusion runs — and CUDA's sm_80 support remains first-class. PCIe Gen 4 form factor in the standard PCIe SKU means it slots into any reasonable datacenter PCIe server (no DGX/SXM4 motherboard required). 250 W TDP PCIe vs 400 W SXM4 is dramatically more practical for non-hyperscaler buyers. NVLink-pair (via the A100 NVL bridge) gives 80 GB combined for ~$15,000–$17,000 used — a viable cheap path to 80 GB CUDA. Resale liquidity is strong: A100 40GB has the highest used-transaction volume of any datacenter GPU.
Where it breaks
- 40 GB ceiling is a real constraint. 70B Q4 doesn't fit 40 GB (needs ~40 GB minimum just for weights, no headroom for KV cache + context). 70B Q3 fits but quality degrades. 405B is impossible. For modern LLM workloads where 70B FP16 / Q4 is common, 40 GB is below the practical floor.
- No FP8 native. Ampere is BF16/FP16/INT8 only — modern frameworks that aggressively exploit FP8 (TRT-LLM, vLLM FP8 paths, certain quantization libraries) lose substantial throughput here vs Hopper / Blackwell.
- Bandwidth gap to newer cards. 1.55 TB/s is below H100's 3.35 TB/s and well below H200's 4.8 TB/s. Long-context decode shows the gap clearly.
- Architecture EOL is approaching. NVIDIA still supports sm_80 in CUDA 12.x but feature parity with newer architectures fades each release. New optimizations skip Ampere.
- Resale erosion. Used pricing has dropped from $20,000+ peaks to $7,000–$9,000. As H100 and H200 absorb the upper tier, expect continued price softening.
- Cap-ex retail is hard to justify. $11,000 retail in 2026 vs renting at ~$1.00–$1.50/hr makes cap-ex breakeven ~7,000+ hours = 9 months 24×7. Most workloads should rent.
Ideal model range
- Sweet spot: 32B FP16 production serving with 32K context — 8–16 concurrent users via vLLM.
- Sweet spot: 13B–20B class high-throughput serving — 100+ concurrent users at sub-100ms TTFT.
- Sweet spot: 70B Q3 single-card with 4–8K context — fits 40 GB tight but functional.
- Sweet spot (NVL pair): 70B Q4 across 2× A100 40GB NVLinked (80 GB combined) — the cheapest CUDA 80 GB path in 2026.
- Sweet spot: BF16 fine-tuning at 7B QLoRA, or 13B QLoRA with paged optimizer.
- Comfortable: Embedding models, classifiers, smaller LMs at very high concurrency.
Bad use cases
- 70B+ FP16 / FP8 production inference. 40 GB ceiling kills this. Pick A100 80GB SXM, H100, or H200.
- Frontier-model anything. 200B+ class models won't fit (or won't fit well even with paged offload).
- FP8-aggressive workloads. Ampere doesn't have it. Pick H100/H200/B200.
- Single-developer hobby workloads. RTX 4090 at $1,800 has 24 GB and CUDA at 1/4 the price. A100 40GB only makes sense for production.
- Cap-ex retail in 2026. Pick used at $7,000–$9,000 or rent. Don't pay $11,000 retail.
Verdict
Buy this if you find a used A100 40GB at $7,000–$9,000, you're operating production inference for 13B–32B-class models with multi-tenant serving, your existing fleet is Ampere (matching is sensible), or you need cheap CUDA 80 GB via NVLinked pair. The A100 40GB is the cost-floor pick for datacenter-grade Ampere when 40 GB is enough.
Skip this if you need 70B+ inference (40 GB is below the floor for practical 70B serving), FP8 throughput matters, you're standing up new cap-ex (pick A100 80GB or H100 PCIe), workload fits 24 GB (RTX 4090 wins on $/throughput), or your utilization is intermittent (rent on Runpod / Lambda at $1.00–$1.50/hr).
How it compares
- vs A100 80GB SXM → 80GB SXM has 2× the memory + 28% more bandwidth + SXM4 NVLink mesh at ~$14,000–$17,000 used vs 40GB at ~$7,000–$9,000 used. Pick 80GB SXM for 70B-class production; 40GB for 32B-and-below value pick. See /compare/nvidia-a100-40gb-vs-nvidia-a100-80gb-sxm.
- vs H100 PCIe (80 GB) → H100 PCIe has 2× memory + 29% more bandwidth + FP8 + Hopper architecture at ~$25,000 retail. Pick H100 for new builds and FP8-exploiting workloads; A100 40GB for cost-conscious production where FP8 isn't critical.
- vs L40S (48 GB) → L40S has 20% more memory + 56% lower bandwidth (864 vs 1,555 GB/s) + Ada-gen FP8 at $7,500 retail. Pick L40S for 48 GB-floor production with FP8 pipeline; A100 40GB for value used + tighter memory ceiling acceptance + bandwidth-bound workloads.
- vs RTX 3090 (24 GB) → 3090 has 1.6× the bandwidth (936 vs 1,555 GB/s — wait, A100 wins here actually) — and A100 has more bandwidth (1.55 TB/s) and 67% more VRAM. Used 3090 at $700–$1,000 vs A100 40GB at $7,000–$9,000 = ~10× price ratio. Pick 3090 for hobbyist / homelab; A100 40GB only for ECC + datacenter pedigree + rack deployment.
- vs renting on Runpod / Lambda / Together → A100 40GB rents at $1.00–$1.50/hr. Cap-ex breakeven is ~7,000 hours = 9 months 24×7. For workloads <50% utilization, rent. For steady-state production, buy used (don't pay retail).
Overview
Original A100. 40GB HBM2 at 1.55 TB/s. Trained the early generation of frontier models.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 40 GB |
| Power draw (peak) | 400 W |
| Released | 2020 |
| MSRP | $11000 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA A100 40GB with usable context.
Frequently asked
What models can NVIDIA A100 40GB run?
Does NVIDIA A100 40GB support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.