NVIDIA L40S
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Ada-gen datacenter card. 48GB GDDR6 — popular at cloud GPU rentals as a budget H100 alternative.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 715 / 1000. Headline = 715 × 0.70 (Estimated-confidence discount) = 500. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 864 GB/s bandwidth — 103.7 tok/s estimated. No measured benchmarks yet.
Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The L40S is the cleanest "production inference at moderate scale" GPU NVIDIA sells. 48 GB GDDR6 ECC at 864 GB/s is enough memory to fit 70B Q4 with 16K context entirely on one card and enough bandwidth to keep the math units fed for typical decode. It runs the full CUDA + cuDNN + TensorRT-LLM stack — every production serving framework that exists is supported and tuned for it. Power draw caps at 350 W (vs 700 W on an H100) so per-card thermal density in a 4U chassis is about half the H100's, which is exactly why hyperscalers use it in dense inference clusters. PCIe Gen 4 x16 form factor (no NVLink) means you just plug it into any modern server — no SXM motherboard premium, no cooling headache, no DGX. Pricing at $7,500–$8,500 retail is roughly 1/3 to 1/4 of an H100 PCIe and the gap on inference (vs training) workloads is closer to 1.5×–2× rather than 3×–4×. For most 70B-class deployments, that's the better $/throughput. NVIDIA's vBIOS + ECC RAM + 5-year warranty are real datacenter-grade differentiators vs the consumer 4090 or 5090 in production.
Where it breaks
- Memory bandwidth is the bottleneck, not compute. 864 GB/s is meaningfully below the H100's 2 TB/s and the RTX 5090's 1.79 TB/s. For memory-bound decode (the dominant inference workload), an L40S decoding 70B Q4 at single-batch will be slower than a 5090 doing the same — same FP8 ops/s on the L40S but less bandwidth.
- No NVLink. Tensor parallelism across 2× L40S has to traverse PCIe Gen 4 x16 (32 GB/s effective). For 70B FP16 you'd need 2× cards, and PCIe-only TP introduces ~10–20% overhead vs NVLink-equipped H100/H200 setups. Acceptable, but not free.
- Training is the wrong workload here. Yes, it has FP8/BF16 throughput. No, you should not be picking L40S for training over an H200 or A100 at scale — training is bandwidth-and-NVLink-sensitive in ways inference isn't.
- Limited consumer software paths. Ollama, LM Studio, llama.cpp all run fine, but the ergonomics are oriented around vLLM/SGLang/TensorRT-LLM. If you're a hobbyist running a single model, you're paying for ECC + datacenter cooling features you don't need.
- Power requirements are real. 350 W TDP needs a serious PSU and case airflow. Not for a desktop tower without thoughtful cooling.
Ideal model range
- Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~30–50 tok/s decode, 4–8 concurrent users via vLLM continuous batching. The everyday production sweet spot for "we run our own 70B."
- Sweet spot: 32B-class production serving — 32B at ~80–120 tok/s decode, 8–16 concurrent users, 32K context. Best $/req-served on this card class.
- Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
- Stretch: 70B FP16 across 2× L40S (96 GB total) via tensor parallelism + PCIe. Works, ~10–20% TP penalty vs H100 NVLink.
- Comfortable: Embedding models, classifiers, smaller LMs at very high batch — the L40S is essentially compute-bound here.
Bad use cases
- Single-user hobby workloads. A used 3090 or 4090 is ~1/4 the price for similar single-user performance on most workloads. ECC + 5-year warranty + vBIOS is wasted on a single-developer rig.
- Frontier-model training. Pick H200 (141 GB) or rent B200 at scale.
- Anywhere bandwidth dominates. Long-context decode on huge prompts is a 2 TB/s-plus card's job, not an L40S's.
- Buying retail at MSRP for one-off use. L40S in cloud rental ($1.50–$2.50/hr on Runpod/Lambda) makes more sense for intermittent workloads than a $7,500 cap-ex.
Verdict
Buy this if you're standing up production inference for 70B Q4 / 32B at full / multi-tenant 13B serving in your own datacenter or colo, you need ECC + datacenter warranty + dense rack thermals, your serving stack is vLLM/SGLang/TensorRT-LLM, and you've calculated $/throughput against H100 PCIe and concluded L40S wins. This is the canonical "production-grade self-hosted inference at SMB scale" GPU.
Skip this if you're a hobbyist or single-user developer (4090/5090 is dramatically better $/$), you need long-context heavy throughput (H200 or rent B200), you're training (wrong tool), or you want the lowest total cost of ownership for intermittent workloads (rent on Runpod or Lambda instead). For most readers Googling "L40S vs 4090 for local AI," the right answer is: 4090 for hobbyist, L40S for production multi-tenant, rent for everything in between.
How it compares
- vs RTX 4090 (24 GB) → 4090 has ~1.16× bandwidth (1 TB/s) and roughly equivalent FP16 perf at half the price. L40S has 2× memory + ECC + datacenter warranty + SR-IOV. Pick 4090 for hobby and dev rigs; pick L40S for production. See /compare/nvidia-l40s-vs-rtx-4090.
- vs H100 PCIe (80 GB) → H100 wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), and NVLink for multi-card. L40S wins on $/card (1/3 the price) and power (1/2 the TDP). Pick H100 for frontier/long-context; pick L40S for 70B-class production serving where you'd never use the H100's extra headroom. See /compare/nvidia-l40s-vs-nvidia-h100-pcie.
- vs RTX A6000 Ada (48 GB) → Same memory (48 GB), similar bandwidth band, broadly equivalent inference perf. L40S is the datacenter SKU; A6000 Ada is the workstation SKU. Pick A6000 Ada for under-the-desk workstation use; pick L40S for rack deployment.
- vs renting on Runpod / Lambda → L40S rents for ~$1.50–$2.50/hr on most providers. At ~$8,000 cap-ex, breakeven vs always-on rental is ~3,200–5,300 hours = 4-6 months of 24×7 utilization. If your workload is intermittent (<50% utilization), rent. If it's steady-state production, buy.
- vs DGX Spark → Different markets entirely. DGX Spark is a desk-side dev box with ARM CPU + Grace memory targeting 200B+ MoE local development. L40S is a rack inference card for production serving. Don't confuse them.
Overview
Ada-gen datacenter card. 48GB GDDR6 — popular at cloud GPU rentals as a budget H100 alternative.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 48 GB |
| Power draw (peak) | 350 W |
| Released | 2023 |
| MSRP | $8500 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA L40S with usable context.
Frequently asked
What models can NVIDIA L40S run?
Does NVIDIA L40S support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.