NVIDIA L40 for local AI

What it does well

The L40 is the L40S's value-tier sibling for production inference deployments where FP8 throughput isn't the limiting factor. Same 48 GB GDDR6 ECC at 864 GB/s bandwidth, same Ada-generation tensor core architecture, same PCIe Gen 4 x16 form factor — at ~$8,000 retail vs L40S's ~$8,500. The L40 has slightly less aggressive clock targets and lacks some of the L40S's display engine pipeline (the L40S was designed dual-purpose as creative + inference; the L40 is more pure-inference-focused but actually less tuned for it). For 70B Q4 single-card inference, 32B FP16 production serving, or any inference workload that fits 48 GB and isn't critically dependent on FP8 throughput, L40 delivers ~85–90% of L40S throughput at slightly lower price. Datacenter-grade ECC + 5-year warranty + vBIOS for VM passthrough + SR-IOV all work identically. Power draw caps at 300 W TDP — slightly less than L40S's 350 W, useful for dense rack deployments where every watt counts.

Where it breaks

Lower FP8 throughput than L40S. The L40S has more aggressive Ada Tensor Core clocking specifically for FP8 inference workloads. On TRT-LLM or vLLM FP8 paths, expect L40S to be ~10–15% faster. For BF16/FP16-only workloads the gap closes considerably.
Pricing gap to L40S is small. $500 difference for ~10–15% more inference throughput on L40S. Most production buyers should pay the modest premium for L40S unless specifically constrained.
Architecture is one generation behind Blackwell. RTX PRO 6000 Blackwell and other Blackwell-tier cards have FP4 native + TE2; L40 is firmly Ada-generation.
Limited consumer-facing software ergonomics. Like the L40S, this is a datacenter SKU — no display outputs (or minimal), no consumer driver paths, no game-tuning. Workstation buyers should pick RTX 6000 Ada instead at a similar price tier.
Resale liquidity is thin. L40 has lower transaction volume than L40S in secondary markets — exit pricing is harder to predict.

Ideal model range

Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~25–40 tok/s decode, 4–8 concurrent users via vLLM continuous batching.
Sweet spot: 32B-class production serving — 32B at ~70–110 tok/s decode, 8–16 concurrent users at 32K context.
Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
Sweet spot: BF16/FP16 production where FP8 isn't the bottleneck — embeddings, classifiers, smaller LMs.
Stretch: 70B FP16 across 2× L40 with PCIe-only TP (~10–20% NVLink-comparable penalty).
Comfortable: Anything an RTX 4080 does, but at 3× the memory ceiling and with ECC + datacenter pedigree.

Bad use cases

Single-developer hobby workloads. RTX 4090 at 1/4 the price wins for everything that fits 24 GB.
Workstation tower deployment. Pick RTX 6000 Ada — same memory tier, more workstation-friendly thermal design + display outputs + Studio drivers.
FP8-aggressive inference. Pay the modest premium for L40S if your workloads exploit FP8 throughput.
Frontier-model training. H200 or B200 is the right tier.
Memory-bound long-context decode. H100 PCIe at 2 TB/s wins for bandwidth-dominated workloads.

Verdict

Buy this if you find an L40 at meaningfully lower price than L40S (>$500 discount, or ~$7,000 used territory), your production workloads are BF16/FP16 (not FP8-aggressive), and you're optimizing $/throughput on Ada-generation 48 GB inference. The L40 is the right pick for the cost-conscious buyer who's already chosen "datacenter Ada 48 GB" and wants the value variant.

Skip this if the L40S is available at $500 premium (L40S wins on FP8 throughput, almost always worth it), you're deploying workstation tier (RTX 6000 Ada is the workstation SKU at similar price), you need Blackwell-gen features (RTX PRO 6000 Blackwell for workstation, B200 for datacenter), or you're cost-sensitive and consumer cards fit (RTX 4090).

How it compares

vs L40S (48 GB) → Same architecture, same 48 GB, ~10–15% less FP8 throughput at ~$500 less. Pick L40S for FP8-aggressive workloads (almost always worth $500); L40 only when discount is meaningful or workloads are FP16/BF16 only. See /compare/nvidia-l40-vs-nvidia-l40s.
vs RTX 6000 Ada (48 GB) → Same memory tier, same architecture, similar bandwidth. RTX 6000 Ada is the workstation SKU (Studio drivers, display outputs, NVLink-2-card paired). L40 is the datacenter SKU (rack form, vBIOS, SR-IOV). Pick by deployment context. RTX 6000 Ada at $6,799 retail is also slightly cheaper.
vs A40 (48 GB Ampere) → A40 is one architecture generation older with similar memory at ~$5,500 retail / $4,000–$4,500 used. Pick L40 for new builds with Ada-generation features (FP8 + better TC perf). Pick A40 for cost-conscious value buyers.
vs H100 PCIe (80 GB) → H100 PCIe wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), Hopper-generation FP8 + Transformer Engine. L40 wins on cap-ex (1/3 the price). For 70B-class inference where 48 GB suffices, L40 is the value pick; for >48 GB or bandwidth-bound workloads, H100 PCIe.
vs RTX 4090 (24 GB) → 4090 has marginally higher bandwidth (1.0 TB/s) and similar Ada compute, at half the VRAM. Pick 4090 for hobbyist 24 GB; L40 when you need 48 GB + ECC + datacenter pedigree.

Frequently asked

What models can NVIDIA L40 run?

With 48GB VRAM, the NVIDIA L40 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA L40 support CUDA?

Yes — NVIDIA L40 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

What it does well

Where it breaks

Lower FP8 throughput than L40S. The L40S has more aggressive Ada Tensor Core clocking specifically for FP8 inference workloads. On TRT-LLM or vLLM FP8 paths, expect L40S to be ~10–15% faster. For BF16/FP16-only workloads the gap closes considerably.

Pricing gap to L40S is small. $500 difference for ~10–15% more inference throughput on L40S. Most production buyers should pay the modest premium for L40S unless specifically constrained.

Architecture is one generation behind Blackwell. RTX PRO 6000 Blackwell and other Blackwell-tier cards have FP4 native + TE2; L40 is firmly Ada-generation.

Limited consumer-facing software ergonomics. Like the L40S, this is a datacenter SKU — no display outputs (or minimal), no consumer driver paths, no game-tuning. Workstation buyers should pick RTX 6000 Ada instead at a similar price tier.

Resale liquidity is thin. L40 has lower transaction volume than L40S in secondary markets — exit pricing is harder to predict.

Ideal model range

Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~25–40 tok/s decode, 4–8 concurrent users via vLLM continuous batching.

Sweet spot: 32B-class production serving — 32B at ~70–110 tok/s decode, 8–16 concurrent users at 32K context.

Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.

Sweet spot: BF16/FP16 production where FP8 isn't the bottleneck — embeddings, classifiers, smaller LMs.

Stretch: 70B FP16 across 2× L40 with PCIe-only TP (~10–20% NVLink-comparable penalty).

Comfortable: Anything an RTX 4080 does, but at 3× the memory ceiling and with ECC + datacenter pedigree.

Bad use cases

Single-developer hobby workloads. RTX 4090 at 1/4 the price wins for everything that fits 24 GB.

Workstation tower deployment. Pick RTX 6000 Ada — same memory tier, more workstation-friendly thermal design + display outputs + Studio drivers.

FP8-aggressive inference. Pay the modest premium for L40S if your workloads exploit FP8 throughput.

Frontier-model training. H200 or B200 is the right tier.

Memory-bound long-context decode. H100 PCIe at 2 TB/s wins for bandwidth-dominated workloads.

Verdict

How it compares

vs L40S (48 GB) → Same architecture, same 48 GB, ~10–15% less FP8 throughput at ~$500 less. Pick L40S for FP8-aggressive workloads (almost always worth $500); L40 only when discount is meaningful or workloads are FP16/BF16 only. See /compare/nvidia-l40-vs-nvidia-l40s.

vs RTX 6000 Ada (48 GB) → Same memory tier, same architecture, similar bandwidth. RTX 6000 Ada is the workstation SKU (Studio drivers, display outputs, NVLink-2-card paired). L40 is the datacenter SKU (rack form, vBIOS, SR-IOV). Pick by deployment context. RTX 6000 Ada at $6,799 retail is also slightly cheaper.

vs A40 (48 GB Ampere) → A40 is one architecture generation older with similar memory at ~$5,500 retail / $4,000–$4,500 used. Pick L40 for new builds with Ada-generation features (FP8 + better TC perf). Pick A40 for cost-conscious value buyers.

vs H100 PCIe (80 GB) → H100 PCIe wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), Hopper-generation FP8 + Transformer Engine. L40 wins on cap-ex (1/3 the price). For 70B-class inference where 48 GB suffices, L40 is the value pick; for >48 GB or bandwidth-bound workloads, H100 PCIe.

vs RTX 4090 (24 GB) → 4090 has marginally higher bandwidth (1.0 TB/s) and similar Ada compute, at half the VRAM. Pick 4090 for hobbyist 24 GB; L40 when you need 48 GB + ECC + datacenter pedigree.

Frequently asked

What models can NVIDIA L40 run?

With 48GB VRAM, the NVIDIA L40 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA L40 support CUDA?

Yes — NVIDIA L40 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

VRAM	48 GB
Power draw (peak)	300 W
Released	2022
MSRP	$8000
Backends	CUDA

VRAM	48 GB
Power draw (peak)	300 W
Released	2022
MSRP	$8000
Backends	CUDA

NVIDIA L40

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Frequently asked

What models can NVIDIA L40 run?

Does NVIDIA L40 support CUDA?

Where next?

NVIDIA L40

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Frequently asked

What models can NVIDIA L40 run?

Does NVIDIA L40 support CUDA?

Where next?

Hardware worth comparing