NVIDIA A40 for local AI

What it does well

The A40 is the Ampere-generation 48 GB datacenter card and the cheapest path to 48 GB CUDA in a rack form factor in 2026. 48 GB GDDR6 ECC at 696 GB/s + Ampere tensor cores + the full CUDA datacenter stack at $5,500 retail (or $3,000–$4,500 well-circulated used). Despite being two architecture generations behind in 2026, the A40 retains genuinely useful properties for production inference: comfortable 70B Q4 single-card hosting (48 GB fits 70B Q4 with 16K context), strong 32B FP16 production serving, and rack-grade discipline (vBIOS + ECC + 5-year warranty + SR-IOV vGPU). 300 W TDP single-blower form factor drops into any standard PCIe Gen 4 server. Hyperscalers deployed A40 widely from 2021–2023, so used market liquidity is excellent — pricing has settled and you can consistently find clean A40s with documented service history. For buyers who want a 48 GB CUDA datacenter card at a deep discount and accept the architecture gap, A40 is genuinely good value.

Where it breaks

Two architecture generations behind in 2026. Ada Lovelace (L40S, RTX 6000 Ada) and Blackwell (RTX PRO 6000 Blackwell) deliver dramatically better tensor compute, FP8 native support, and architecture-specific optimizations. New CUDA features land on Ada / Blackwell first.
No FP8 native. Ampere is BF16/FP16/INT8 only. Modern frameworks that exploit FP8 throughput don't get speedup.
Bandwidth gap. 696 GB/s is below L40S (864 GB/s) and well below H100 PCIe (2 TB/s). Long-context decode is bandwidth-limited compared to current-gen.
Display engine designed for visualization. A40 was originally a virtualization / professional graphics card before pivoting to inference workloads. The chip has display engine resources that matter zero for AI but consume some die area.
Resale erosion. Used pricing dropped from $4,500–$5,000 in 2023 to $3,000–$4,000 in 2026 as L40S absorbed the 48 GB inference market. Continued softening expected.
End-of-feature-support risk. sm_86 Ampere support remains in CUDA 12.x but new optimizations skip Ampere; bug fix horizon is limited.

Ideal model range

Sweet spot: 70B Q4 single-card production inference with 8–16K context. 25–35 tok/s decode at single-tenant — fine for SMB-tier production.
Sweet spot: 32B FP16 production serving with 32K context, 8–16 concurrent users via vLLM continuous batching.
Sweet spot: 13B–20B class high-throughput serving — 100+ concurrent users at sub-100ms TTFT.
Sweet spot: BF16 fine-tuning at 7B–13B QLoRA with paged optimizer.
Sweet spot (NVL pair): 70B FP16 across 2× A40 NVLinked (96 GB combined) — viable cheap path to 70B FP16 CUDA in 2026.
Comfortable: Embedding models, classifiers, smaller LMs at very high concurrency.

Bad use cases

FP8-aggressive inference workloads. No native FP8. Pick Hopper / Ada / Blackwell.
Frontier model anything. 48 GB doesn't fit 100B+ class models without aggressive partial offload.
Cap-ex retail in 2026. Pick used at $3,000–$4,000 or rent. Don't pay $5,500 retail when L40S at $7,500 is the architecturally-current 48 GB datacenter pick at modest premium.
Workstation deployment. RTX A6000 (Ampere) at similar prices is the workstation-form 48 GB Ampere SKU — A40 is rack-only.
Single-developer hobby workloads. RTX 4090 at $1,800 wins for everything that fits 24 GB.
Anyone production-deploying for 5+ years. Ampere architecture sunset is approaching.

Verdict

Buy this if you find used A40 at $3,000–$4,000, you're standing up production inference at SMB tier where 48 GB matters and architecture-current isn't critical, you have a 3-4 year operational horizon, and your stack is BF16/FP16-friendly (FP8 throughput isn't the limiting factor). A40 is the right "value 48 GB datacenter Ampere" pick for cost-conscious production buyers.

Skip this if you're standing up new builds (pick L40S at $7,500 for the architectural current path), you need FP8 (Hopper / Ada-gen FP8 / Blackwell), you're workstation-tier (RTX 6000 Ada is the right SKU), you're cost-floor 24 GB (used 3090 wins at $700), or you have a 5+ year horizon (architecture sunset risk).

How it compares

vs L40S (48 GB) → L40S has Ada-gen FP8 + ~24% more bandwidth + same memory + datacenter pedigree at $7,500 retail. A40 used at $3,500 is ~half the price for two-gen-older silicon. Pick L40S for new builds and FP8-exploiting workloads; A40 used for cost-conscious 48 GB inference where FP8 isn't critical. See /compare/nvidia-a40-vs-nvidia-l40s.
vs RTX A6000 Ampere (48 GB) → Same architecture, same memory tier. A6000 is workstation-form (PCIe blower with display outputs, NVLink-2-card paired, Studio drivers); A40 is rack-form (no displays, vBIOS for VM passthrough). Used pricing similar ~$3,500–$4,500. Pick by deployment context.
vs A100 40GB → A100 40GB has HBM2 (1.55 TB/s vs 696 GB/s — 2.2× the bandwidth) but 17% less memory. Pick A100 40GB for bandwidth-bound workloads (long-context decode); A40 for memory-ceiling-bound workloads (where 48 GB fits and 40 GB doesn't).
vs RTX 6000 Ada (48 GB) → 6000 Ada has Ada-gen architecture + FP8 + ~38% more bandwidth + ISV cert + Studio drivers at $6,799 retail. A40 used at half the price. Pick 6000 Ada for serious workstation builds; A40 for value buys + rack deployments.
vs RTX A6000 Ada / RTX PRO 6000 Blackwell → PRO 6000 Blackwell has 96 GB / Blackwell architecture / 1.79 TB/s at $8,499. Different tier entirely. A40 only when budget forces value Ampere choice.

Frequently asked

What models can NVIDIA A40 run?

With 48GB VRAM, the NVIDIA A40 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA A40 support CUDA?

Yes — NVIDIA A40 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

What it does well

Where it breaks

Two architecture generations behind in 2026. Ada Lovelace (L40S, RTX 6000 Ada) and Blackwell (RTX PRO 6000 Blackwell) deliver dramatically better tensor compute, FP8 native support, and architecture-specific optimizations. New CUDA features land on Ada / Blackwell first.

No FP8 native. Ampere is BF16/FP16/INT8 only. Modern frameworks that exploit FP8 throughput don't get speedup.

Bandwidth gap. 696 GB/s is below L40S (864 GB/s) and well below H100 PCIe (2 TB/s). Long-context decode is bandwidth-limited compared to current-gen.

Display engine designed for visualization. A40 was originally a virtualization / professional graphics card before pivoting to inference workloads. The chip has display engine resources that matter zero for AI but consume some die area.

Resale erosion. Used pricing dropped from $4,500–$5,000 in 2023 to $3,000–$4,000 in 2026 as L40S absorbed the 48 GB inference market. Continued softening expected.

End-of-feature-support risk. sm_86 Ampere support remains in CUDA 12.x but new optimizations skip Ampere; bug fix horizon is limited.

Ideal model range

Sweet spot: 70B Q4 single-card production inference with 8–16K context. 25–35 tok/s decode at single-tenant — fine for SMB-tier production.

Sweet spot: 32B FP16 production serving with 32K context, 8–16 concurrent users via vLLM continuous batching.

Sweet spot: 13B–20B class high-throughput serving — 100+ concurrent users at sub-100ms TTFT.

Sweet spot: BF16 fine-tuning at 7B–13B QLoRA with paged optimizer.

Sweet spot (NVL pair): 70B FP16 across 2× A40 NVLinked (96 GB combined) — viable cheap path to 70B FP16 CUDA in 2026.

Comfortable: Embedding models, classifiers, smaller LMs at very high concurrency.

Bad use cases

FP8-aggressive inference workloads. No native FP8. Pick Hopper / Ada / Blackwell.

Frontier model anything. 48 GB doesn't fit 100B+ class models without aggressive partial offload.

Cap-ex retail in 2026. Pick used at $3,000–$4,000 or rent. Don't pay $5,500 retail when L40S at $7,500 is the architecturally-current 48 GB datacenter pick at modest premium.

Workstation deployment. RTX A6000 (Ampere) at similar prices is the workstation-form 48 GB Ampere SKU — A40 is rack-only.

Single-developer hobby workloads. RTX 4090 at $1,800 wins for everything that fits 24 GB.

Anyone production-deploying for 5+ years. Ampere architecture sunset is approaching.

Verdict

How it compares

vs L40S (48 GB) → L40S has Ada-gen FP8 + ~24% more bandwidth + same memory + datacenter pedigree at $7,500 retail. A40 used at $3,500 is ~half the price for two-gen-older silicon. Pick L40S for new builds and FP8-exploiting workloads; A40 used for cost-conscious 48 GB inference where FP8 isn't critical. See /compare/nvidia-a40-vs-nvidia-l40s.

vs RTX A6000 Ampere (48 GB) → Same architecture, same memory tier. A6000 is workstation-form (PCIe blower with display outputs, NVLink-2-card paired, Studio drivers); A40 is rack-form (no displays, vBIOS for VM passthrough). Used pricing similar ~$3,500–$4,500. Pick by deployment context.

vs A100 40GB → A100 40GB has HBM2 (1.55 TB/s vs 696 GB/s — 2.2× the bandwidth) but 17% less memory. Pick A100 40GB for bandwidth-bound workloads (long-context decode); A40 for memory-ceiling-bound workloads (where 48 GB fits and 40 GB doesn't).

vs RTX 6000 Ada (48 GB) → 6000 Ada has Ada-gen architecture + FP8 + ~38% more bandwidth + ISV cert + Studio drivers at $6,799 retail. A40 used at half the price. Pick 6000 Ada for serious workstation builds; A40 for value buys + rack deployments.

vs RTX A6000 Ada / RTX PRO 6000 Blackwell → PRO 6000 Blackwell has 96 GB / Blackwell architecture / 1.79 TB/s at $8,499. Different tier entirely. A40 only when budget forces value Ampere choice.

Frequently asked

What models can NVIDIA A40 run?

With 48GB VRAM, the NVIDIA A40 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA A40 support CUDA?

Yes — NVIDIA A40 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

VRAM	48 GB
Power draw (peak)	300 W
Released	2020
MSRP	$5500
Backends	CUDA

VRAM	48 GB
Power draw (peak)	300 W
Released	2020
MSRP	$5500
Backends	CUDA

NVIDIA A40

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Frequently asked

What models can NVIDIA A40 run?

Does NVIDIA A40 support CUDA?

Where next?

NVIDIA A40

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Frequently asked

What models can NVIDIA A40 run?

Does NVIDIA A40 support CUDA?

Where next?

Hardware worth comparing