NVIDIA L40
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Original Ada datacenter. Slower than L40S. 48GB GDDR6.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 719 / 1000. Headline = 719 × 0.70 (Estimated-confidence discount) = 503. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 864 GB/s bandwidth — 103.7 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The L40 is the L40S's value-tier sibling for production inference deployments where FP8 throughput isn't the limiting factor. Same 48 GB GDDR6 ECC at 864 GB/s bandwidth, same Ada-generation tensor core architecture, same PCIe Gen 4 x16 form factor — at ~$8,000 retail vs L40S's ~$8,500. The L40 has slightly less aggressive clock targets and lacks some of the L40S's display engine pipeline (the L40S was designed dual-purpose as creative + inference; the L40 is more pure-inference-focused but actually less tuned for it). For 70B Q4 single-card inference, 32B FP16 production serving, or any inference workload that fits 48 GB and isn't critically dependent on FP8 throughput, L40 delivers ~85–90% of L40S throughput at slightly lower price. Datacenter-grade ECC + 5-year warranty + vBIOS for VM passthrough + SR-IOV all work identically. Power draw caps at 300 W TDP — slightly less than L40S's 350 W, useful for dense rack deployments where every watt counts.
Where it breaks
- Lower FP8 throughput than L40S. The L40S has more aggressive Ada Tensor Core clocking specifically for FP8 inference workloads. On TRT-LLM or vLLM FP8 paths, expect L40S to be ~10–15% faster. For BF16/FP16-only workloads the gap closes considerably.
- Pricing gap to L40S is small. $500 difference for ~10–15% more inference throughput on L40S. Most production buyers should pay the modest premium for L40S unless specifically constrained.
- Architecture is one generation behind Blackwell. RTX PRO 6000 Blackwell and other Blackwell-tier cards have FP4 native + TE2; L40 is firmly Ada-generation.
- Limited consumer-facing software ergonomics. Like the L40S, this is a datacenter SKU — no display outputs (or minimal), no consumer driver paths, no game-tuning. Workstation buyers should pick RTX 6000 Ada instead at a similar price tier.
- Resale liquidity is thin. L40 has lower transaction volume than L40S in secondary markets — exit pricing is harder to predict.
Ideal model range
- Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~25–40 tok/s decode, 4–8 concurrent users via vLLM continuous batching.
- Sweet spot: 32B-class production serving — 32B at ~70–110 tok/s decode, 8–16 concurrent users at 32K context.
- Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
- Sweet spot: BF16/FP16 production where FP8 isn't the bottleneck — embeddings, classifiers, smaller LMs.
- Stretch: 70B FP16 across 2× L40 with PCIe-only TP (~10–20% NVLink-comparable penalty).
- Comfortable: Anything an RTX 4080 does, but at 3× the memory ceiling and with ECC + datacenter pedigree.
Bad use cases
- Single-developer hobby workloads. RTX 4090 at 1/4 the price wins for everything that fits 24 GB.
- Workstation tower deployment. Pick RTX 6000 Ada — same memory tier, more workstation-friendly thermal design + display outputs + Studio drivers.
- FP8-aggressive inference. Pay the modest premium for L40S if your workloads exploit FP8 throughput.
- Frontier-model training. H200 or B200 is the right tier.
- Memory-bound long-context decode. H100 PCIe at 2 TB/s wins for bandwidth-dominated workloads.
Verdict
Buy this if you find an L40 at meaningfully lower price than L40S (>$500 discount, or ~$7,000 used territory), your production workloads are BF16/FP16 (not FP8-aggressive), and you're optimizing $/throughput on Ada-generation 48 GB inference. The L40 is the right pick for the cost-conscious buyer who's already chosen "datacenter Ada 48 GB" and wants the value variant.
Skip this if the L40S is available at $500 premium (L40S wins on FP8 throughput, almost always worth it), you're deploying workstation tier (RTX 6000 Ada is the workstation SKU at similar price), you need Blackwell-gen features (RTX PRO 6000 Blackwell for workstation, B200 for datacenter), or you're cost-sensitive and consumer cards fit (RTX 4090).
How it compares
- vs L40S (48 GB) → Same architecture, same 48 GB, ~10–15% less FP8 throughput at ~$500 less. Pick L40S for FP8-aggressive workloads (almost always worth $500); L40 only when discount is meaningful or workloads are FP16/BF16 only. See /compare/nvidia-l40-vs-nvidia-l40s.
- vs RTX 6000 Ada (48 GB) → Same memory tier, same architecture, similar bandwidth. RTX 6000 Ada is the workstation SKU (Studio drivers, display outputs, NVLink-2-card paired). L40 is the datacenter SKU (rack form, vBIOS, SR-IOV). Pick by deployment context. RTX 6000 Ada at $6,799 retail is also slightly cheaper.
- vs A40 (48 GB Ampere) → A40 is one architecture generation older with similar memory at ~$5,500 retail / $4,000–$4,500 used. Pick L40 for new builds with Ada-generation features (FP8 + better TC perf). Pick A40 for cost-conscious value buyers.
- vs H100 PCIe (80 GB) → H100 PCIe wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), Hopper-generation FP8 + Transformer Engine. L40 wins on cap-ex (1/3 the price). For 70B-class inference where 48 GB suffices, L40 is the value pick; for >48 GB or bandwidth-bound workloads, H100 PCIe.
- vs RTX 4090 (24 GB) → 4090 has marginally higher bandwidth (1.0 TB/s) and similar Ada compute, at half the VRAM. Pick 4090 for hobbyist 24 GB; L40 when you need 48 GB + ECC + datacenter pedigree.
Overview
Original Ada datacenter. Slower than L40S. 48GB GDDR6.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 48 GB |
| Power draw (peak) | 300 W |
| Released | 2022 |
| MSRP | $8000 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA L40 with usable context.
Frequently asked
What models can NVIDIA L40 run?
Does NVIDIA L40 support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.