NVIDIA L40S for local AI

Q: What models can NVIDIA L40S run?

With 48GB VRAM, the NVIDIA L40S runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Q: Does NVIDIA L40S support CUDA?

Yes — NVIDIA L40S is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

What it does well

The L40S is the cleanest "production inference at moderate scale" GPU NVIDIA sells. 48 GB GDDR6 ECC at 864 GB/s is enough memory to fit 70B Q4 with 16K context entirely on one card and enough bandwidth to keep the math units fed for typical decode. It runs the full CUDA + cuDNN + TensorRT-LLM stack — every production serving framework that exists is supported and tuned for it. Power draw caps at 350 W (vs 700 W on an H100) so per-card thermal density in a 4U chassis is about half the H100's, which is exactly why hyperscalers use it in dense inference clusters. PCIe Gen 4 x16 form factor (no NVLink) means you just plug it into any modern server — no SXM motherboard premium, no cooling headache, no DGX. Pricing at $7,500–$8,500 retail is roughly 1/3 to 1/4 of an H100 PCIe and the gap on inference (vs training) workloads is closer to 1.5×–2× rather than 3×–4×. For most 70B-class deployments, that's the better $/throughput. NVIDIA's vBIOS + ECC RAM + 5-year warranty are real datacenter-grade differentiators vs the consumer 4090 or 5090 in production.

Where it breaks

Memory bandwidth is the bottleneck, not compute. 864 GB/s is meaningfully below the H100's 2 TB/s and the RTX 5090's 1.79 TB/s. For memory-bound decode (the dominant inference workload), an L40S decoding 70B Q4 at single-batch will be slower than a 5090 doing the same — same FP8 ops/s on the L40S but less bandwidth.
No NVLink. Tensor parallelism across 2× L40S has to traverse PCIe Gen 4 x16 (32 GB/s effective). For 70B FP16 you'd need 2× cards, and PCIe-only TP introduces ~10–20% overhead vs NVLink-equipped H100/H200 setups. Acceptable, but not free.
Training is the wrong workload here. Yes, it has FP8/BF16 throughput. No, you should not be picking L40S for training over an H200 or A100 at scale — training is bandwidth-and-NVLink-sensitive in ways inference isn't.
Limited consumer software paths. Ollama, LM Studio, llama.cpp all run fine, but the ergonomics are oriented around vLLM/SGLang/TensorRT-LLM. If you're a hobbyist running a single model, you're paying for ECC + datacenter cooling features you don't need.
Power requirements are real. 350 W TDP needs a serious PSU and case airflow. Not for a desktop tower without thoughtful cooling.

Ideal model range

Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~30–50 tok/s decode, 4–8 concurrent users via vLLM continuous batching. The everyday production sweet spot for "we run our own 70B."
Sweet spot: 32B-class production serving — 32B at ~80–120 tok/s decode, 8–16 concurrent users, 32K context. Best $/req-served on this card class.
Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
Stretch: 70B FP16 across 2× L40S (96 GB total) via tensor parallelism + PCIe. Works, ~10–20% TP penalty vs H100 NVLink.
Comfortable: Embedding models, classifiers, smaller LMs at very high batch — the L40S is essentially compute-bound here.

Bad use cases

Single-user hobby workloads. A used 3090 or 4090 is ~1/4 the price for similar single-user performance on most workloads. ECC + 5-year warranty + vBIOS is wasted on a single-developer rig.
Frontier-model training. Pick H200 (141 GB) or rent B200 at scale.
Anywhere bandwidth dominates. Long-context decode on huge prompts is a 2 TB/s-plus card's job, not an L40S's.
Buying retail at MSRP for one-off use. L40S in cloud rental ($1.50–$2.50/hr on Runpod/Lambda) makes more sense for intermittent workloads than a $7,500 cap-ex.

Verdict

Buy this if you're standing up production inference for 70B Q4 / 32B at full / multi-tenant 13B serving in your own datacenter or colo, you need ECC + datacenter warranty + dense rack thermals, your serving stack is vLLM/SGLang/TensorRT-LLM, and you've calculated $/throughput against H100 PCIe and concluded L40S wins. This is the canonical "production-grade self-hosted inference at SMB scale" GPU.

Skip this if you're a hobbyist or single-user developer (4090/5090 is dramatically better $/$), you need long-context heavy throughput (H200 or rent B200), you're training (wrong tool), or you want the lowest total cost of ownership for intermittent workloads (rent on Runpod or Lambda instead). For most readers Googling "L40S vs 4090 for local AI," the right answer is: 4090 for hobbyist, L40S for production multi-tenant, rent for everything in between.

How it compares

vs RTX 4090 (24 GB) → 4090 has ~1.16× bandwidth (1 TB/s) and roughly equivalent FP16 perf at half the price. L40S has 2× memory + ECC + datacenter warranty + SR-IOV. Pick 4090 for hobby and dev rigs; pick L40S for production. See /compare/nvidia-l40s-vs-rtx-4090.
vs H100 PCIe (80 GB) → H100 wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), and NVLink for multi-card. L40S wins on $/card (1/3 the price) and power (1/2 the TDP). Pick H100 for frontier/long-context; pick L40S for 70B-class production serving where you'd never use the H100's extra headroom. See /compare/nvidia-l40s-vs-nvidia-h100-pcie.
vs RTX A6000 Ada (48 GB) → Same memory (48 GB), similar bandwidth band, broadly equivalent inference perf. L40S is the datacenter SKU; A6000 Ada is the workstation SKU. Pick A6000 Ada for under-the-desk workstation use; pick L40S for rack deployment.
vs renting on Runpod / Lambda → L40S rents for ~$1.50–$2.50/hr on most providers. At ~$8,000 cap-ex, breakeven vs always-on rental is ~3,200–5,300 hours = 4-6 months of 24×7 utilization. If your workload is intermittent (<50% utilization), rent. If it's steady-state production, buy.
vs DGX Spark → Different markets entirely. DGX Spark is a desk-side dev box with ARM CPU + Grace memory targeting 200B+ MoE local development. L40S is a rack inference card for production serving. Don't confuse them.

Frequently asked

What models can NVIDIA L40S run?

With 48GB VRAM, the NVIDIA L40S runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA L40S support CUDA?

Yes — NVIDIA L40S is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

NVIDIA L40S

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Models that fit

Frequently asked

What models can NVIDIA L40S run?

Does NVIDIA L40S support CUDA?

Where next?

Hardware worth comparing

VRAM	48 GB
Power draw (peak)	350 W
Released	2023
MSRP	$8500
Backends	CUDA