RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Hardware
  4. /NVIDIA L40S
UNIT · NVIDIA · GPU
48 GB VRAMworkstation·Reviewed June 2026

NVIDIA L40S

NVDA · HARDWARE
NVIDIA L40S

No editorial image yet — generic vendor mark shown. Credentials in spec table below.

Ada-gen datacenter card. 48GB GDDR6 — popular at cloud GPU rentals as a budget H100 alternative.

Released 2023·864 GB/s memory bandwidth
▼ CHECK CURRENT PRICE· 1 retailer
NVIDIA L40S
Check on Amazon→

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE
See full leaderboard →
500/ 1000
BB-tier
Estimated
Throughput
301/ 500
VRAM-fit
190/ 200
Ecosystem
200/ 200
Efficiency
24/ 100

Sub-scores sum to 715 / 1000. Headline = 715 × 0.70 (Estimated-confidence discount) = 500. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 864 GB/s bandwidth — 103.7 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT
Try other hardware →

Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.

7B chat✓
Comfortable
14B chat✓
Comfortable
32B chat✓
Comfortable
70B chat~
Tight
Coding agent✓
Comfortable
Vision (≤8B VLM)✓
Comfortable
Long context (32K)✓
Comfortable
✓Comfortable — fits with headroom
~Tight — works, no slack
△Marginal — needs aggressive quant
✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
10.0/10

What it does well

The L40S is the cleanest "production inference at moderate scale" GPU NVIDIA sells. 48 GB GDDR6 ECC at 864 GB/s is enough memory to fit 70B Q4 with 16K context entirely on one card and enough bandwidth to keep the math units fed for typical decode. It runs the full CUDA + cuDNN + TensorRT-LLM stack — every production serving framework that exists is supported and tuned for it. Power draw caps at 350 W (vs 700 W on an H100) so per-card thermal density in a 4U chassis is about half the H100's, which is exactly why hyperscalers use it in dense inference clusters. PCIe Gen 4 x16 form factor (no NVLink) means you just plug it into any modern server — no SXM motherboard premium, no cooling headache, no DGX. Pricing at $7,500–$8,500 retail is roughly 1/3 to 1/4 of an H100 PCIe and the gap on inference (vs training) workloads is closer to 1.5×–2× rather than 3×–4×. For most 70B-class deployments, that's the better $/throughput. NVIDIA's vBIOS + ECC RAM + 5-year warranty are real datacenter-grade differentiators vs the consumer 4090 or 5090 in production.

Where it breaks

  • Memory bandwidth is the bottleneck, not compute. 864 GB/s is meaningfully below the H100's 2 TB/s and the RTX 5090's 1.79 TB/s. For memory-bound decode (the dominant inference workload), an L40S decoding 70B Q4 at single-batch will be slower than a 5090 doing the same — same FP8 ops/s on the L40S but less bandwidth.
  • No NVLink. Tensor parallelism across 2× L40S has to traverse PCIe Gen 4 x16 (32 GB/s effective). For 70B FP16 you'd need 2× cards, and PCIe-only TP introduces ~10–20% overhead vs NVLink-equipped H100/H200 setups. Acceptable, but not free.
  • Training is the wrong workload here. Yes, it has FP8/BF16 throughput. No, you should not be picking L40S for training over an H200 or A100 at scale — training is bandwidth-and-NVLink-sensitive in ways inference isn't.
  • Limited consumer software paths. Ollama, LM Studio, llama.cpp all run fine, but the ergonomics are oriented around vLLM/SGLang/TensorRT-LLM. If you're a hobbyist running a single model, you're paying for ECC + datacenter cooling features you don't need.
  • Power requirements are real. 350 W TDP needs a serious PSU and case airflow. Not for a desktop tower without thoughtful cooling.

Ideal model range

  • Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~30–50 tok/s decode, 4–8 concurrent users via vLLM continuous batching. The everyday production sweet spot for "we run our own 70B."
  • Sweet spot: 32B-class production serving — 32B at ~80–120 tok/s decode, 8–16 concurrent users, 32K context. Best $/req-served on this card class.
  • Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
  • Stretch: 70B FP16 across 2× L40S (96 GB total) via tensor parallelism + PCIe. Works, ~10–20% TP penalty vs H100 NVLink.
  • Comfortable: Embedding models, classifiers, smaller LMs at very high batch — the L40S is essentially compute-bound here.

Bad use cases

  • Single-user hobby workloads. A used 3090 or 4090 is ~1/4 the price for similar single-user performance on most workloads. ECC + 5-year warranty + vBIOS is wasted on a single-developer rig.
  • Frontier-model training. Pick H200 (141 GB) or rent B200 at scale.
  • Anywhere bandwidth dominates. Long-context decode on huge prompts is a 2 TB/s-plus card's job, not an L40S's.
  • Buying retail at MSRP for one-off use. L40S in cloud rental ($1.50–$2.50/hr on Runpod/Lambda) makes more sense for intermittent workloads than a $7,500 cap-ex.

Verdict

Buy this if you're standing up production inference for 70B Q4 / 32B at full / multi-tenant 13B serving in your own datacenter or colo, you need ECC + datacenter warranty + dense rack thermals, your serving stack is vLLM/SGLang/TensorRT-LLM, and you've calculated $/throughput against H100 PCIe and concluded L40S wins. This is the canonical "production-grade self-hosted inference at SMB scale" GPU.

Skip this if you're a hobbyist or single-user developer (4090/5090 is dramatically better $/$), you need long-context heavy throughput (H200 or rent B200), you're training (wrong tool), or you want the lowest total cost of ownership for intermittent workloads (rent on Runpod or Lambda instead). For most readers Googling "L40S vs 4090 for local AI," the right answer is: 4090 for hobbyist, L40S for production multi-tenant, rent for everything in between.

How it compares

  • vs RTX 4090 (24 GB) → 4090 has ~1.16× bandwidth (1 TB/s) and roughly equivalent FP16 perf at half the price. L40S has 2× memory + ECC + datacenter warranty + SR-IOV. Pick 4090 for hobby and dev rigs; pick L40S for production. See /compare/nvidia-l40s-vs-rtx-4090.
  • vs H100 PCIe (80 GB) → H100 wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), and NVLink for multi-card. L40S wins on $/card (1/3 the price) and power (1/2 the TDP). Pick H100 for frontier/long-context; pick L40S for 70B-class production serving where you'd never use the H100's extra headroom. See /compare/nvidia-l40s-vs-nvidia-h100-pcie.
  • vs RTX A6000 Ada (48 GB) → Same memory (48 GB), similar bandwidth band, broadly equivalent inference perf. L40S is the datacenter SKU; A6000 Ada is the workstation SKU. Pick A6000 Ada for under-the-desk workstation use; pick L40S for rack deployment.
  • vs renting on Runpod / Lambda → L40S rents for ~$1.50–$2.50/hr on most providers. At ~$8,000 cap-ex, breakeven vs always-on rental is ~3,200–5,300 hours = 4-6 months of 24×7 utilization. If your workload is intermittent (<50% utilization), rent. If it's steady-state production, buy.
  • vs DGX Spark → Different markets entirely. DGX Spark is a desk-side dev box with ARM CPU + Grace memory targeting 200B+ MoE local development. L40S is a rack inference card for production serving. Don't confuse them.
BLK · OVERVIEW

Overview

Ada-gen datacenter card. 48GB GDDR6 — popular at cloud GPU rentals as a budget H100 alternative.

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

BLK · SPECS

Specs

VRAM48 GB
Power draw (peak)350 W
Released2023
MSRP$8500
Backends
CUDA

Models that fit

Open-weight models small enough to run on NVIDIA L40S with usable context.

all-MiniLM-L6-v2
0.022B · other
FLUX.1 [dev]
12B · other
Qwen 3 0.6B
0.6B · qwen
BGE Large EN v1.5
0.335B · other
Nomic Embed Text v1.5
0.137B · other
Kokoro 82M
0.082B · other
Llama 3.1 8B Instruct
8B · llama
XTTS v2
0.46B · other

Frequently asked

What models can NVIDIA L40S run?

With 48GB VRAM, the NVIDIA L40S runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA L40S support CUDA?

Yes — NVIDIA L40S is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

Compare alternatives

Hardware worth comparing

The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.

Closest matches
Similar price, bandwidth & form factor
  • NVIDIA L40
    nvidia · 48 GB VRAM
    10.0/10
  • AMD Instinct MI210
    amd · 64 GB VRAM
    9.8/10
  • NVIDIA RTX 6000 Ada Generation
    nvidia · 48 GB VRAM
    10.0/10
  • NVIDIA RTX 5000 PRO Blackwell 48GB
    nvidia · 48 GB VRAM
    8.5/10
  • NVIDIA A40
    nvidia · 48 GB VRAM
    9.7/10
  • Intel Arc Pro B60 24GB
    intel · 24 GB VRAM
    7.6/10
Step up
More capable — more memory or a higher tier
  • AMD Instinct MI210
    amd · 64 GB VRAM
    9.8/10
  • Intel Gaudi 2
    intel · 96 GB VRAM
    7.9/10
  • NVIDIA RTX PRO 6000 Blackwell
    nvidia · 96 GB VRAM
    10.0/10
Step down
Lighter — cheaper or more constrained
  • NVIDIA RTX A6000 (Ampere)
    nvidia · 48 GB VRAM
    9.7/10
  • Intel Arc Pro B60 24GB
    intel · 24 GB VRAM
    7.6/10
  • Apple Mac Studio (M3 Ultra)
    apple · 800 GB/s
    10.0/10