RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
  1. >
  2. Home
  3. /Hardware
  4. /NVIDIA L40
UNIT · NVIDIA · GPU
48 GB VRAMworkstation·Reviewed June 2026

NVIDIA L40

NVDA · HARDWARE
NVIDIA L40

No editorial image yet — generic vendor mark shown. Credentials in spec table below.

Original Ada datacenter. Slower than L40S. 48GB GDDR6.

Released 2022·864 GB/s memory bandwidth
▼ CHECK CURRENT PRICE· 1 retailer
NVIDIA L40
Check on Amazon→

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE
See full leaderboard →
503/ 1000
BB-tier
Estimated
Throughput
301/ 500
VRAM-fit
190/ 200
Ecosystem
200/ 200
Efficiency
28/ 100

Sub-scores sum to 719 / 1000. Headline = 719 × 0.70 (Estimated-confidence discount) = 503. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 864 GB/s bandwidth — 103.7 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT
Try other hardware →

Plain-English: Runs 70B with care — snappy enough for a coding agent; vision models supported.

7B chat✓
Comfortable
14B chat✓
Comfortable
32B chat✓
Comfortable
70B chat~
Tight
Coding agent✓
Comfortable
Vision (≤8B VLM)✓
Comfortable
Long context (32K)✓
Comfortable
✓Comfortable — fits with headroom
~Tight — works, no slack
△Marginal — needs aggressive quant
✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
10.0/10

What it does well

The L40 is the L40S's value-tier sibling for production inference deployments where FP8 throughput isn't the limiting factor. Same 48 GB GDDR6 ECC at 864 GB/s bandwidth, same Ada-generation tensor core architecture, same PCIe Gen 4 x16 form factor — at ~$8,000 retail vs L40S's ~$8,500. The L40 has slightly less aggressive clock targets and lacks some of the L40S's display engine pipeline (the L40S was designed dual-purpose as creative + inference; the L40 is more pure-inference-focused but actually less tuned for it). For 70B Q4 single-card inference, 32B FP16 production serving, or any inference workload that fits 48 GB and isn't critically dependent on FP8 throughput, L40 delivers ~85–90% of L40S throughput at slightly lower price. Datacenter-grade ECC + 5-year warranty + vBIOS for VM passthrough + SR-IOV all work identically. Power draw caps at 300 W TDP — slightly less than L40S's 350 W, useful for dense rack deployments where every watt counts.

Where it breaks

  • Lower FP8 throughput than L40S. The L40S has more aggressive Ada Tensor Core clocking specifically for FP8 inference workloads. On TRT-LLM or vLLM FP8 paths, expect L40S to be ~10–15% faster. For BF16/FP16-only workloads the gap closes considerably.
  • Pricing gap to L40S is small. $500 difference for ~10–15% more inference throughput on L40S. Most production buyers should pay the modest premium for L40S unless specifically constrained.
  • Architecture is one generation behind Blackwell. RTX PRO 6000 Blackwell and other Blackwell-tier cards have FP4 native + TE2; L40 is firmly Ada-generation.
  • Limited consumer-facing software ergonomics. Like the L40S, this is a datacenter SKU — no display outputs (or minimal), no consumer driver paths, no game-tuning. Workstation buyers should pick RTX 6000 Ada instead at a similar price tier.
  • Resale liquidity is thin. L40 has lower transaction volume than L40S in secondary markets — exit pricing is harder to predict.

Ideal model range

  • Sweet spot: 70B Q4–Q5 single-card serving with 16K context at ~25–40 tok/s decode, 4–8 concurrent users via vLLM continuous batching.
  • Sweet spot: 32B-class production serving — 32B at ~70–110 tok/s decode, 8–16 concurrent users at 32K context.
  • Sweet spot: 13B–20B-class high-throughput serving — 200+ concurrent users at sub-100ms TTFT.
  • Sweet spot: BF16/FP16 production where FP8 isn't the bottleneck — embeddings, classifiers, smaller LMs.
  • Stretch: 70B FP16 across 2× L40 with PCIe-only TP (~10–20% NVLink-comparable penalty).
  • Comfortable: Anything an RTX 4080 does, but at 3× the memory ceiling and with ECC + datacenter pedigree.

Bad use cases

  • Single-developer hobby workloads. RTX 4090 at 1/4 the price wins for everything that fits 24 GB.
  • Workstation tower deployment. Pick RTX 6000 Ada — same memory tier, more workstation-friendly thermal design + display outputs + Studio drivers.
  • FP8-aggressive inference. Pay the modest premium for L40S if your workloads exploit FP8 throughput.
  • Frontier-model training. H200 or B200 is the right tier.
  • Memory-bound long-context decode. H100 PCIe at 2 TB/s wins for bandwidth-dominated workloads.

Verdict

Buy this if you find an L40 at meaningfully lower price than L40S (>$500 discount, or ~$7,000 used territory), your production workloads are BF16/FP16 (not FP8-aggressive), and you're optimizing $/throughput on Ada-generation 48 GB inference. The L40 is the right pick for the cost-conscious buyer who's already chosen "datacenter Ada 48 GB" and wants the value variant.

Skip this if the L40S is available at $500 premium (L40S wins on FP8 throughput, almost always worth it), you're deploying workstation tier (RTX 6000 Ada is the workstation SKU at similar price), you need Blackwell-gen features (RTX PRO 6000 Blackwell for workstation, B200 for datacenter), or you're cost-sensitive and consumer cards fit (RTX 4090).

How it compares

  • vs L40S (48 GB) → Same architecture, same 48 GB, ~10–15% less FP8 throughput at ~$500 less. Pick L40S for FP8-aggressive workloads (almost always worth $500); L40 only when discount is meaningful or workloads are FP16/BF16 only. See /compare/nvidia-l40-vs-nvidia-l40s.
  • vs RTX 6000 Ada (48 GB) → Same memory tier, same architecture, similar bandwidth. RTX 6000 Ada is the workstation SKU (Studio drivers, display outputs, NVLink-2-card paired). L40 is the datacenter SKU (rack form, vBIOS, SR-IOV). Pick by deployment context. RTX 6000 Ada at $6,799 retail is also slightly cheaper.
  • vs A40 (48 GB Ampere) → A40 is one architecture generation older with similar memory at ~$5,500 retail / $4,000–$4,500 used. Pick L40 for new builds with Ada-generation features (FP8 + better TC perf). Pick A40 for cost-conscious value buyers.
  • vs H100 PCIe (80 GB) → H100 PCIe wins on bandwidth (2 TB/s vs 864 GB/s), memory ceiling (80 GB vs 48 GB), Hopper-generation FP8 + Transformer Engine. L40 wins on cap-ex (1/3 the price). For 70B-class inference where 48 GB suffices, L40 is the value pick; for >48 GB or bandwidth-bound workloads, H100 PCIe.
  • vs RTX 4090 (24 GB) → 4090 has marginally higher bandwidth (1.0 TB/s) and similar Ada compute, at half the VRAM. Pick 4090 for hobbyist 24 GB; L40 when you need 48 GB + ECC + datacenter pedigree.
BLK · OVERVIEW

Overview

Original Ada datacenter. Slower than L40S. 48GB GDDR6.

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

BLK · SPECS

Specs

VRAM48 GB
Power draw (peak)300 W
Released2022
MSRP$8000
Backends
CUDA

Models that fit

Open-weight models small enough to run on NVIDIA L40 with usable context.

all-MiniLM-L6-v2
0.022B · other
FLUX.1 [dev]
12B · other
Qwen 3 0.6B
0.6B · qwen
BGE Large EN v1.5
0.335B · other
Nomic Embed Text v1.5
0.137B · other
Kokoro 82M
0.082B · other
Llama 3.1 8B Instruct
8B · llama
XTTS v2
0.46B · other

Frequently asked

What models can NVIDIA L40 run?

With 48GB VRAM, the NVIDIA L40 runs 70B models in 4-bit quantization, plus everything smaller. See the model list below for tested combinations.

Does NVIDIA L40 support CUDA?

Yes — NVIDIA L40 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

Where next?

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Compare alternatives

Hardware worth comparing

The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.

Closest matches
Similar price, bandwidth & form factor
  • NVIDIA L40S
    nvidia · 48 GB VRAM
    10.0/10
  • AMD Instinct MI210
    amd · 64 GB VRAM
    9.8/10
  • NVIDIA RTX 6000 Ada Generation
    nvidia · 48 GB VRAM
    10.0/10
  • NVIDIA A40
    nvidia · 48 GB VRAM
    9.7/10
  • NVIDIA RTX 5000 PRO Blackwell 48GB
    nvidia · 48 GB VRAM
    8.5/10
  • Intel Arc Pro B60 24GB
    intel · 24 GB VRAM
    7.6/10
Step up
More capable — more memory or a higher tier
  • AMD Instinct MI210
    amd · 64 GB VRAM
    9.8/10
  • Intel Gaudi 2
    intel · 96 GB VRAM
    7.9/10
  • AMD Instinct MI250X
    amd · 128 GB VRAM
    9.7/10
Step down
Lighter — cheaper or more constrained
  • NVIDIA RTX A6000 (Ampere)
    nvidia · 48 GB VRAM
    9.7/10
  • NVIDIA RTX 5000 Ada Generation
    nvidia · 32 GB VRAM
    9.5/10
  • Intel Arc Pro B60 24GB
    intel · 24 GB VRAM
    7.6/10