NVIDIA L4
No editorial image yet — generic vendor mark shown. Credentials in spec table below.
Inference-focused Ada datacenter card. Low-power 24GB suitable for 7B-14B serving.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 514 / 1000. Headline = 514 × 0.70 (Estimated-confidence discount) = 360. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 300 GB/s bandwidth — 36.0 tok/s estimated. No measured benchmarks yet.
Plain-English: Workable at 32B, comfortable at 14B and below — coding agent feels deliberate; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The L4 is NVIDIA's single-slot low-power Ada-generation datacenter card and the right pick for rack-density inference deployments where W/inference matters more than peak throughput. 24 GB GDDR6 ECC at 300 GB/s + Ada Tensor Cores + the full CUDA datacenter stack at $2,500 retail / $1,800-2,200 used. Power draw at 72 W TDP is the lowest of any 24 GB datacenter card by a wide margin (vs L40S's 350 W, RTX A6000 Ada's 300 W) — single-slot half-height form factor lets you pack 8× L4 in a 2U server pulling under 600 W total. For workloads where you serve many small models at high concurrency (embedding, classification, smaller LLMs at scale, video encoding/transcoding alongside inference), the L4 is the rack-density king. Hyperscalers (Google Cloud, Lambda, smaller specialty providers) deploy L4 for the cost-per-inference-per-watt advantage — rental at ~$0.50-$0.85/hr is the cheapest 24 GB GPU rental tier on most providers.
Where it breaks
- Bandwidth is the hard limiter. 300 GB/s is dramatically below RTX 4090's 1 TB/s, L40S's 864 GB/s, and even the consumer RTX 4060 Ti's 288 GB/s. For memory-bound LLM decode (the dominant workload), L4 is meaningfully slower than essentially every other 24 GB card.
- Compute ceiling vs higher-tier Ada cards. The 72 W power envelope caps tensor compute at ~120 TFLOPS FP16 — roughly 1/3 of L40S's ~360 TFLOPS. For compute-bound workloads, L4 is firmly value tier.
- Wrong tier for primary 70B-class inference. 24 GB fits 70B Q4 with 16K context, but at 300 GB/s decode is single-digit tok/s. Use L4 for many smaller models, not one big one.
- Half-height single-slot form factor limits dGPU thermal options. Cooling solutions are server-grade only — no consumer card paths.
- No display engine. Pure compute SKU; no consumer driver paths.
- Resale liquidity is thin in retail used market. Most L4s are in production datacenters, not consumer eBay channels.
Ideal model range
- Sweet spot: Embedding model serving at very high concurrency (1000+ users via batching). Embedding workloads are the canonical L4 fit.
- Sweet spot: Smaller LLM serving (sub-13B) at high concurrency — the per-watt economics dominate.
- Sweet spot: Multi-tenant inference where rack density and power efficiency matter more than peak per-request latency.
- Sweet spot: Video transcoding + AI workloads on the same card (Ada NVENC/NVDEC + tensor cores).
- Sweet spot: Edge inference deployments where 72 W power envelope is the constraint.
- Stretch: 70B Q4 single-card serving (functional but slow at 5-10 tok/s decode).
- Bad fit: Any workload that prioritizes single-request latency or peak tok/s.
Bad use cases
- Single-user / hobby workloads. Wrong tier entirely. Pick consumer NVIDIA.
- Maximum tok/s on bigger models. L40S at 3× the price has 2.9× the bandwidth — pays for itself on most workloads.
- 70B as the primary use case. L4 fits 70B but at single-digit tok/s. Use L40S or higher tier.
- Anyone primarily decode-bound. Bandwidth ceiling kills decode speed. Pick higher-bandwidth tier.
- Cap-ex without rack-density requirements. If you don't need 8× cards in 2U, you're paying for a constraint that doesn't apply.
Verdict
Buy this if you operate rack-density inference deployments where W/inference dominates economics, your workload is embedding / classification / sub-13B serving at high concurrency, you need 8× 24 GB cards in a 2U server, and the modest per-card throughput is acceptable for the parallel scaling. L4 is the right pick for the "many small models at scale" segment.
Skip this if you need peak throughput (L40S at 3× the price wins on most metrics), workload is primarily 70B+ (L40S or H100 PCIe wins), single-user / consumer workloads (consumer NVIDIA wins), or you don't have the rack-density use case (pay for higher-bandwidth tier instead).
How it compares
- vs L40S (48 GB) → L40S has 2× memory + 2.9× bandwidth + ~3× compute at 3× the price (350 W TDP). For peak throughput on 70B-class workloads, L40S wins decisively. Pick L4 only when rack density + low power are genuinely the constraint. See /compare/nvidia-l4-vs-nvidia-l40s.
- vs RTX A5000 (24 GB) → Same 24 GB tier. A5000 has 2.5× the bandwidth + 2× the compute at 4× the power draw (230 W vs 72 W). Pick A5000 for workstation; L4 for datacenter rack density.
- vs RTX 4090 (24 GB) → 4090 has 3.3× the bandwidth + dramatically more compute at 6× the power draw (450 W vs 72 W). Pick 4090 for hobbyist / desktop; L4 for datacenter rack-density only.
- vs RTX A4000 / RTX 4000 Ada → A4000 / 4000 Ada are workstation single-slot 16-20 GB cards. Different tier. Pick L4 specifically for datacenter rack form factor + 24 GB memory.
- vs renting on cloud → L4 rents at $0.50-$0.85/hr — the cheapest 24 GB GPU rental tier. Cap-ex breakeven is ~3,000-5,000 hours = 4-7 months of 24×7. Most workloads should rent.
Overview
Inference-focused Ada datacenter card. Low-power 24GB suitable for 7B-14B serving.
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 24 GB |
| Power draw (peak) | 72 W |
| Released | 2023 |
| MSRP | $2500 |
| Backends | CUDA |
Models that fit
Open-weight models small enough to run on NVIDIA L4 with usable context.
Hardware worth comparing
The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.
- 8.2/10Framework Desktop (Ryzen AI Max+ 395)amd · 256 GB/s
- 8.7/10NVIDIA RTX A5000nvidia · 24 GB VRAM
- 7.6/10Intel Arc Pro B60 24GBintel · 24 GB VRAM
- 8.0/10GMKtec EVO-X2 (Ryzen AI Max+ 395)amd · 256 GB/s
- 7.5/10NVIDIA RTX PRO 4500 Blackwellnvidia · 32 GB VRAM
- 8.1/10ASUS Ascent GX10 (NVIDIA GB10)nvidia · 273 GB/s
Frequently asked
What models can NVIDIA L4 run?
Does NVIDIA L4 support CUDA?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.