Llama 4 405B
Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.
Positioning
Llama 4 405B is Meta's dense flagship in the Llama 4 family, carrying 405 billion parameters and a 131,072-token context window. Released under the Llama 4 Community License, it is designed for frontier-tier serving on cluster hardware. As a dense model, every forward pass activates all 405B parameters, making inference compute-bound and memory-intensive — a deliberate trade-off for maximum per-token capability.
Strengths
Massive parameter count for dense reasoning. With 405B active parameters per token, Llama 4 405B dedicates its full capacity to every generation, avoiding the capacity dilution that can occur in mixture-of-experts architectures.
Very long native context. The 131,072-token context window supports extended document analysis, multi-turn conversations, and large-context retrieval tasks without architectural modifications.
Permissive commercial license. The Llama 4 Community License allows most commercial use, making this model viable for enterprise deployment where licensing restrictions are a concern.
Proven lineage. As a direct evolution of Llama 3.1 405B, it benefits from Meta's ongoing investment in reasoning improvements and community tooling.
Limitations
Extreme hardware requirements. At FP16, the model alone occupies 810 GB of GPU memory. Even at Q4_K_M (228 GB), fitting the model plus KV cache and framework overhead (30–50% additional) requires multiple high-memory GPUs — typically a cluster of A100 80GB or H100 units.
No community benchmarks available. We do not yet have independent, reproducible measurements for this model. Published vendor metrics should be treated as best-case; real-world throughput and quality may differ.
Dense architecture is compute-heavy. Unlike MoE models that activate only a fraction of parameters per token, Llama 4 405B uses all 405B parameters at every step, increasing both memory bandwidth and compute demands relative to an MoE of similar total size.
License scope. The Llama 4 Community License, while permissive for many use cases, may impose restrictions on certain high-volume or competitive applications. Operators should review the full license text before deployment.
What it takes to run this locally
Quantized model sizes (disk):
- FP16: ~810 GB
- Q8_0: ~430 GB
- Q6_K: ~334.1 GB
- Q5_K_M: ~288.6 GB
- Q4_K_M: ~227.8 GB
- Q3_K_M: ~197.4 GB
- Q2_K: ~131.6 GB
Add 30–50% for KV cache and framework overhead at typical context lengths. This model is firmly in the datacenter deployment class. A single consumer or workstation GPU cannot accommodate it, even at the lowest quantizations. Running Llama 4 405B requires a multi-GPU cluster (e.g., 4–8× A100 80GB or H100) with high-bandwidth interconnects.
Should you run this locally?
Yes if: you have access to a multi-GPU datacenter cluster, need the highest possible per-token reasoning quality from a dense model, and can accommodate the power and cooling costs. The Llama 4 Community License supports most commercial deployments.
No if: you lack cluster-scale hardware, need low-latency single-GPU inference, or are exploring smaller quantizations for edge deployment. Consider smaller Llama 4 variants or MoE architectures that offer better throughput per watt.
Catalog cross-links
- Llama 3.1 405B
- Llama 4 17B MoE
- A100 80GB
- H100 SXM
Overview
Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.
How to run it
Llama 4 405B is Meta's largest dense model. 405B parameters, 231 GB on disk at Q4_K_M, ~405 GB at FP16. Single-GPU path does not exist — the smallest config that loads is 4× A100 80GB at Q4_K_M (320 GB pool) for batch=1 at 4K context. Recommended: 8× H100 SXM at FP8 with vLLM tensor-parallel=8. On 8× H100 at FP8: ~8-15 tok/s per user at batch=1. Q4_K_M on 4× A100: ~5-10 tok/s at batch=1. KV cache at 8K context adds ~25-35 GB. Llama 4 405B uses standard LLaMA architecture — broad ecosystem support. llama.cpp supports it with row-split across GPUs via CUDA_VISIBLE_DEVICES. Ollama default tag should use Q4_K_M. Mac Studio M4 Ultra 192 GB at Q2_K (120 GB) is theoretically loadable at 2-4 tok/s. Not recommended for interactive use. Cloud rental: 4× H100 at ~$25-40/hr.
Hardware guidance
Minimum: 4× A100 80GB at Q4_K_M (231 GB weights + ~12 GB KV at 4K = 243 GB, fits 320 GB pool). Recommended: 8× H100 SXM at FP8 with NVLink for tensor-parallel communication. VRAM math: dense 405B, Q4_K_M ~0.57 bytes/param → ~231 GB. KV cache: ~0.5 MB/token × context_length per layer. At 8K context, 405B with 128 layers adds ~25 GB. Total minimum VRAM: ~256 GB for Q4 at 8K batch=1. 4× A100 80GB = 320 GB — comfortable with headroom. 4× RTX A6000 48GB = 192 GB — insufficient for Q4_K_M, must use Q2_K (116 GB) with severe quality loss. Mac Studio M4 Ultra 128GB at Q2_K only. No single-consumer-GPU option.
What breaks first
- Cross-GPU communication bottleneck. On non-NVLink setups (PCIe-only), tensor-parallel bandwidth becomes the bottleneck. MFU drops to 15-25% on 4× A100 without NVLink. Use NVLink-bridged pairs whenever possible. 2. First-token latency. 405B dense with tensor-parallel incurs 5-15 second time-to-first-token at 4K context on 4× A100. Not suitable for latency-sensitive applications without speculative decoding. 3. Q2 quality cliff. Q2_K quantization on 405B is viable for loading but quality degrades significantly on factual accuracy and complex reasoning. Benchmark your task before committing to Q2. 4. Ollama default tag may use insufficient context. Verify Ollama's default context length for Llama 4 405B — some tags default to 2048. Override with /set parameter.
Runtime recommendation
vLLM with tensor-parallel=8 on H100 SXM for serving. llama.cpp with -ngl 999 --tensor-split for multi-GPU local use. SGLang as alternative if vLLM memory management causes OOM at long context. Avoid Ollama for multi-GPU — it delegates to llama.cpp but obscures tensor-split config. Avoid MLX-LM — Apple Silicon not viable at this scale.
Common beginner mistakes
Mistake: Thinking dual RTX 4090 (48 GB) can run 405B. Fix: Q4 is 231 GB. Even Q2_K is ~116 GB. Do the math: 48 GB is 5× too small. Mistake: Running at 128K context on Q4_K_M 4× A100. Fix: KV cache at 128K is 300+ GB alone. 4K is the realistic starting point; 8K with headroom. Mistake: Using Ollama without checking --tensor-split. Fix: llama.cpp row-split requires explicit GPU assignment. Ollama obscures this. Use raw llama.cpp server for multi-GPU. Mistake: Expecting fast first-token on 405B. Fix: Time-to-first-token at 4K context is 5-15 seconds on 4× A100. Speculative decoding with a 7B draft model cuts this significantly. Mistake: Renting GPUs without NVLink and expecting high throughput. Fix: Without NVLink, MFU drops below 25%. Rent NVLink-bridged instances (A100 SXM, H100 SXM) if throughput matters.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Frontier-tier reasoning
- Strong multilingual
Weaknesses
- Multi-node cluster only
- Llama Community License usage restrictions for very large companies
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| AWQ-INT4 | 230.0 GB | 280 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Llama 4 405B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Llama 4 405B?
Can I use Llama 4 405B commercially?
What's the context length of Llama 4 405B?
Source: huggingface.co/meta-llama/Llama-4-405B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Llama 4 405B runs on your specific hardware before committing money.