NVIDIA A40
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Ampere workstation/datacenter hybrid. 48GB GDDR6.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 654 / 1000. Headline = 654 × 0.70 (Estimated-confidence discount) = 458. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 696 GB/s bandwidth — 83.5 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The A40 is the Ampere-generation 48 GB datacenter card and the cheapest path to 48 GB CUDA in a rack form factor in 2026. 48 GB GDDR6 ECC at 696 GB/s + Ampere tensor cores + the full CUDA datacenter stack at $5,500 retail (or $3,000–$4,500 well-circulated used). Despite being two architecture generations behind in 2026, the A40 retains genuinely useful properties for production inference: comfortable 70B Q4 single-card hosting (48 GB fits 70B Q4 with 16K context), strong 32B FP16 production serving, and rack-grade discipline (vBIOS + ECC + 5-year warranty + SR-IOV vGPU). 300 W TDP single-blower form factor drops into any standard PCIe Gen 4 server. Hyperscalers deployed A40 widely from 2021–2023, so used market liquidity is excellent — pricing has settled and you can consistently find clean A40s with documented service history. For buyers who want a 48 GB CUDA datacenter card at a deep discount and accept the architecture gap, A40 is genuinely good value.
Where it breaks
- Two architecture generations behind in 2026. Ada Lovelace (L40S, RTX 6000 Ada) and Blackwell (RTX PRO 6000 Blackwell) deliver dramatically better tensor compute, FP8 native support, and architecture-specific optimizations. New CUDA features land on Ada / Blackwell first.
- No FP8 native. Ampere is BF16/FP16/INT8 only. Modern frameworks that exploit FP8 throughput don't get speedup.
- Bandwidth gap. 696 GB/s is below L40S (864 GB/s) and well below H100 PCIe (2 TB/s). Long-context decode is bandwidth-limited compared to current-gen.
- Display engine designed for visualization. A40 was originally a virtualization / professional graphics card before pivoting to inference workloads. The chip has display engine resources that matter zero for AI but consume some die area.
- Resale erosion. Used pricing dropped from $4,500–$5,000 in 2023 to $3,000–$4,000 in 2026 as L40S absorbed the 48 GB inference market. Continued softening expected.
- End-of-feature-support risk. sm_86 Ampere support remains in CUDA 12.x but new optimizations skip Ampere; bug fix horizon is limited.
Ideal model range
- Sweet spot: 70B Q4 single-card production inference with 8–16K context. 25–35 tok/s decode at single-tenant — fine for SMB-tier production.
- Sweet spot: 32B FP16 production serving with 32K context, 8–16 concurrent users via vLLM continuous batching.
- Sweet spot: 13B–20B class high-throughput serving — 100+ concurrent users at sub-100ms TTFT.
- Sweet spot: BF16 fine-tuning at 7B–13B QLoRA with paged optimizer.
- Sweet spot (NVL pair): 70B FP16 across 2× A40 NVLinked (96 GB combined) — viable cheap path to 70B FP16 CUDA in 2026.
- Comfortable: Embedding models, classifiers, smaller LMs at very high concurrency.
Bad use cases
- FP8-aggressive inference workloads. No native FP8. Pick Hopper / Ada / Blackwell.
- Frontier model anything. 48 GB doesn't fit 100B+ class models without aggressive partial offload.
- Cap-ex retail in 2026. Pick used at $3,000–$4,000 or rent. Don't pay $5,500 retail when L40S at $7,500 is the architecturally-current 48 GB datacenter pick at modest premium.
- Workstation deployment. RTX A6000 (Ampere) at similar prices is the workstation-form 48 GB Ampere SKU — A40 is rack-only.
- Single-developer hobby workloads. RTX 4090 at $1,800 wins for everything that fits 24 GB.
- Anyone production-deploying for 5+ years. Ampere architecture sunset is approaching.
Verdict
Buy this if you find used A40 at $3,000–$4,000, you're standing up production inference at SMB tier where 48 GB matters and architecture-current isn't critical, you have a 3-4 year operational horizon, and your stack is BF16/FP16-friendly (FP8 throughput isn't the limiting factor). A40 is the right "value 48 GB datacenter Ampere" pick for cost-conscious production buyers.
Skip this if you're standing up new builds (pick L40S at $7,500 for the architectural current path), you need FP8 (Hopper / Ada-gen FP8 / Blackwell), you're workstation-tier (RTX 6000 Ada is the right SKU), you're cost-floor 24 GB (used 3090 wins at $700), or you have a 5+ year horizon (architecture sunset risk).
How it compares
- vs L40S (48 GB) → L40S has Ada-gen FP8 + ~24% more bandwidth + same memory + datacenter pedigree at $7,500 retail. A40 used at $3,500 is ~half the price for two-gen-older silicon. Pick L40S for new builds and FP8-exploiting workloads; A40 used for cost-conscious 48 GB inference where FP8 isn't critical. See /compare/nvidia-a40-vs-nvidia-l40s.
- vs RTX A6000 Ampere (48 GB) → Same architecture, same memory tier. A6000 is workstation-form (PCIe blower with display outputs, NVLink-2-card paired, Studio drivers); A40 is rack-form (no displays, vBIOS for VM passthrough). Used pricing similar ~$3,500–$4,500. Pick by deployment context.
- vs A100 40GB → A100 40GB has HBM2 (1.55 TB/s vs 696 GB/s — 2.2× the bandwidth) but 17% less memory. Pick A100 40GB for bandwidth-bound workloads (long-context decode); A40 for memory-ceiling-bound workloads (where 48 GB fits and 40 GB doesn't).
- vs RTX 6000 Ada (48 GB) → 6000 Ada has Ada-gen architecture + FP8 + ~38% more bandwidth + ISV cert + Studio drivers at $6,799 retail. A40 used at half the price. Pick 6000 Ada for serious workstation builds; A40 for value buys + rack deployments.
- vs RTX A6000 Ada / RTX PRO 6000 Blackwell → PRO 6000 Blackwell has 96 GB / Blackwell architecture / 1.79 TB/s at $8,499. Different tier entirely. A40 only when budget forces value Ampere choice.
Overview
Ampere workstation/datacenter hybrid. 48GB GDDR6.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 48 GB |
| Power draw (peak) | 300 W |
| Released | 2020 |
| MSRP | $5500 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA A40 with usable context.
Frequently asked
What models can NVIDIA A40 run?
Does NVIDIA A40 support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.