Llama 4 405B

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

License: Llama 4 Community License·Released Feb 10, 2026·Context: 131,072 tokens

Overview

Meta's dense flagship in the Llama 4 line. 405B params; comparable footprint to Llama 3.1 405B with the Llama 4 reasoning improvements.

How to run it

Llama 4 405B is Meta's largest dense model. 405B parameters, 231 GB on disk at Q4_K_M, ~405 GB at FP16. Single-GPU path does not exist — the smallest config that loads is 4× A100 80GB at Q4_K_M (320 GB pool) for batch=1 at 4K context. Recommended: 8× H100 SXM at FP8 with vLLM tensor-parallel=8. On 8× H100 at FP8: ~8-15 tok/s per user at batch=1. Q4_K_M on 4× A100: ~5-10 tok/s at batch=1. KV cache at 8K context adds ~25-35 GB. Llama 4 405B uses standard LLaMA architecture — broad ecosystem support. llama.cpp supports it with row-split across GPUs via CUDA_VISIBLE_DEVICES. Ollama default tag should use Q4_K_M. Mac Studio M4 Ultra 192 GB at Q2_K (120 GB) is theoretically loadable at 2-4 tok/s. Not recommended for interactive use. Cloud rental: 4× H100 at ~$25-40/hr.

Hardware guidance

Minimum: 4× A100 80GB at Q4_K_M (231 GB weights + ~12 GB KV at 4K = 243 GB, fits 320 GB pool). Recommended: 8× H100 SXM at FP8 with NVLink for tensor-parallel communication. VRAM math: dense 405B, Q4_K_M ~0.57 bytes/param → ~231 GB. KV cache: ~0.5 MB/token × context_length per layer. At 8K context, 405B with 128 layers adds ~25 GB. Total minimum VRAM: ~256 GB for Q4 at 8K batch=1. 4× A100 80GB = 320 GB — comfortable with headroom. 4× RTX A6000 48GB = 192 GB — insufficient for Q4_K_M, must use Q2_K (116 GB) with severe quality loss. Mac Studio M4 Ultra 128GB at Q2_K only. No single-consumer-GPU option.

What breaks first

Cross-GPU communication bottleneck. On non-NVLink setups (PCIe-only), tensor-parallel bandwidth becomes the bottleneck. MFU drops to 15-25% on 4× A100 without NVLink. Use NVLink-bridged pairs whenever possible. 2. First-token latency. 405B dense with tensor-parallel incurs 5-15 second time-to-first-token at 4K context on 4× A100. Not suitable for latency-sensitive applications without speculative decoding. 3. Q2 quality cliff. Q2_K quantization on 405B is viable for loading but quality degrades significantly on factual accuracy and complex reasoning. Benchmark your task before committing to Q2. 4. Ollama default tag may use insufficient context. Verify Ollama's default context length for Llama 4 405B — some tags default to 2048. Override with /set parameter.

Runtime recommendation

vLLM with tensor-parallel=8 on H100 SXM for serving. llama.cpp with -ngl 999 --tensor-split for multi-GPU local use. SGLang as alternative if vLLM memory management causes OOM at long context. Avoid Ollama for multi-GPU — it delegates to llama.cpp but obscures tensor-split config. Avoid MLX-LM — Apple Silicon not viable at this scale.

Common beginner mistakes

Mistake: Thinking dual RTX 4090 (48 GB) can run 405B. Fix: Q4 is 231 GB. Even Q2_K is ~116 GB. Do the math: 48 GB is 5× too small. Mistake: Running at 128K context on Q4_K_M 4× A100. Fix: KV cache at 128K is 300+ GB alone. 4K is the realistic starting point; 8K with headroom. Mistake: Using Ollama without checking --tensor-split. Fix: llama.cpp row-split requires explicit GPU assignment. Ollama obscures this. Use raw llama.cpp server for multi-GPU. Mistake: Expecting fast first-token on 405B. Fix: Time-to-first-token at 4K context is 5-15 seconds on 4× A100. Speculative decoding with a 7B draft model cuts this significantly. Mistake: Renting GPUs without NVLink and expecting high throughput. Fix: Without NVLink, MFU drops below 25%. Rent NVLink-bridged instances (A100 SXM, H100 SXM) if throughput matters.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (llama-4)

Llama 4 405B405B

You are here