by Meta
Meta's open-weight LLM family — the dominant baseline for self-hosted text generation. Llama 3.3 70B is the canonical 70B-class chat model in 2026; Llama 3.1 8B remains the most-deployed sub-13B production model.
Start with Llama 3.3 70B at Q4_K_M via Ollama — it matches Llama 3.1 405B on reasoning benchmarks at 6× lower serving cost and fits on 2× RTX 4090 (48 GB VRAM). The 70B sits at the optimal price-quality intersection: MMLU 86.9%, GSM8K 90.5%, usable context to 32K without KV-cache blowout. If you have < 24 GB VRAM, drop to Llama 3.1 8B at Q5_K_M (6 GB) — it runs on a MacBook Pro M4 Max at 25+ tok/s and handles 90% of personal assistant workloads. Skip Llama 3 405B and Llama 4 Behemoth for local use — they require datacenter hardware for usable throughput. Skip Llama 3.2 vision variants unless you specifically need on-device vision — the text models are more mature and better supported.
For single-user local: Ollama + llama3.3:70b Q4_K_M on 2× RTX 4090 or Mac Studio M3 Ultra 192 GB via MLX-LM. For multi-user serving: vLLM 0.6.3+ with AWQ 4-bit on 4× H100 SXM — achieves ~8,000 tok/s at batch 64 with continuous batching and prefix caching enabled. For mobile/edge: llama.cpp Llama 3.1 8B Q4_0 on Snapdragon X Elite via ARM NEON — ~18 tok/s decode. For maximum single-GPU throughput: ExLlamaV2 4.0 bpw on RTX 5090 32 GB — ~45 tok/s decode with flash-attention. For datacenter: TensorRT-LLM FP8 on 8× H100 SXM — ~25,000 tok/s at batch 256. See GPU buyer guide.
Verify Llama runs on your specific hardware before committing money.