RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Models
  4. /Llama 4 405B
llama
405B parameters
Commercial OK
·Reviewed May 2026

Llama 4 405B

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

License: Llama 4 Community License·Released Feb 10, 2026·Context: 131,072 tokens

Overview

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

How to run it

Llama 4 405B is Meta's largest dense model. 405B parameters, 231 GB on disk at Q4_K_M, ~405 GB at FP16. Single-GPU path does not exist — the smallest config that loads is 4× A100 80GB at Q4_K_M (320 GB pool) for batch=1 at 4K context. Recommended: 8× H100 SXM at FP8 with vLLM tensor-parallel=8. On 8× H100 at FP8: ~8-15 tok/s per user at batch=1. Q4_K_M on 4× A100: ~5-10 tok/s at batch=1. KV cache at 8K context adds ~25-35 GB. Llama 4 405B uses standard LLaMA architecture — broad ecosystem support. llama.cpp supports it with row-split across GPUs via CUDA_VISIBLE_DEVICES. Ollama default tag should use Q4_K_M. Mac Studio M4 Ultra 192 GB at Q2_K (120 GB) is theoretically loadable at 2-4 tok/s. Not recommended for interactive use. Cloud rental: 4× H100 at ~$25-40/hr.

Hardware guidance

Minimum: 4× A100 80GB at Q4_K_M (231 GB weights + ~12 GB KV at 4K = 243 GB, fits 320 GB pool). Recommended: 8× H100 SXM at FP8 with NVLink for tensor-parallel communication. VRAM math: dense 405B, Q4_K_M ~0.57 bytes/param → ~231 GB. KV cache: ~0.5 MB/token × context_length per layer. At 8K context, 405B with 128 layers adds ~25 GB. Total minimum VRAM: ~256 GB for Q4 at 8K batch=1. 4× A100 80GB = 320 GB — comfortable with headroom. 4× RTX A6000 48GB = 192 GB — insufficient for Q4_K_M, must use Q2_K (116 GB) with severe quality loss. Mac Studio M4 Ultra 128GB at Q2_K only. No single-consumer-GPU option.

What breaks first

  1. Cross-GPU communication bottleneck. On non-NVLink setups (PCIe-only), tensor-parallel bandwidth becomes the bottleneck. MFU drops to 15-25% on 4× A100 without NVLink. Use NVLink-bridged pairs whenever possible. 2. First-token latency. 405B dense with tensor-parallel incurs 5-15 second time-to-first-token at 4K context on 4× A100. Not suitable for latency-sensitive applications without speculative decoding. 3. Q2 quality cliff. Q2_K quantization on 405B is viable for loading but quality degrades significantly on factual accuracy and complex reasoning. Benchmark your task before committing to Q2. 4. Ollama default tag may use insufficient context. Verify Ollama's default context length for Llama 4 405B — some tags default to 2048. Override with /set parameter.

Runtime recommendation

vLLM with tensor-parallel=8 on H100 SXM for serving. llama.cpp with -ngl 999 --tensor-split for multi-GPU local use. SGLang as alternative if vLLM memory management causes OOM at long context. Avoid Ollama for multi-GPU — it delegates to llama.cpp but obscures tensor-split config. Avoid MLX-LM — Apple Silicon not viable at this scale.

Common beginner mistakes

Mistake: Thinking dual RTX 4090 (48 GB) can run 405B. Fix: Q4 is 231 GB. Even Q2_K is ~116 GB. Do the math: 48 GB is 5× too small. Mistake: Running at 128K context on Q4_K_M 4× A100. Fix: KV cache at 128K is 300+ GB alone. 4K is the realistic starting point; 8K with headroom. Mistake: Using Ollama without checking --tensor-split. Fix: llama.cpp row-split requires explicit GPU assignment. Ollama obscures this. Use raw llama.cpp server for multi-GPU. Mistake: Expecting fast first-token on 405B. Fix: Time-to-first-token at 4K context is 5-15 seconds on 4× A100. Speculative decoding with a 7B draft model cuts this significantly. Mistake: Renting GPUs without NVLink and expecting high throughput. Fix: Without NVLink, MFU drops below 25%. Rent NVLink-bridged instances (A100 SXM, H100 SXM) if throughput matters.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (llama-4)
Llama 4 70B70B
Datacenter
Llama 4 Scout109B
Datacenter
Llama 4 Maverick400B
Frontier
Llama 4 405B405B
You are here

Strengths

  • Frontier-tier reasoning
  • Strong multilingual

Weaknesses

  • Multi-node cluster only
  • Llama Community License usage restrictions for very large companies

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
AWQ-INT4230.0 GB280 GB

Get the model

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-4-405B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 4 405B.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd

Frequently asked

What's the minimum VRAM to run Llama 4 405B?

280GB of VRAM is enough to run Llama 4 405B at the AWQ-INT4 quantization (file size 230.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 405B commercially?

Yes — Llama 4 405B ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 405B?

Llama 4 405B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/meta-llama/Llama-4-405B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • Dual 3090 vs RTX 5090 (48 GB or 32 GB) →
  • RTX 3090 vs RTX 4090 →
Buyer guides
  • 16 GB vs 24 GB VRAM — what 70B-class models need →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
Alternatives
Llama 4 ScoutLlama 4 70BLlama 4 Maverick
Before you buy

Verify Llama 4 405B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
  • Qwen 3 235B-A22B
    qwen · 235B
    unrated
  • DeepSeek V4 Flash (284B MoE)
    deepseek · 284B
    unrated
Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.
Step down
Smaller — faster, runs on weaker hardware
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10