Llama family — local AI ecosystem · RunLocalAI

Start with Llama 3.3 70B at Q4_K_M via Ollama — it matches Llama 3.1 405B on reasoning benchmarks at 6× lower serving cost and fits on 2× RTX 4090 (48 GB VRAM). The 70B sits at the optimal price-quality intersection: MMLU 86.9%, GSM8K 90.5%, usable context to 32K without KV-cache blowout. If you have < 24 GB VRAM, drop to Llama 3.1 8B at Q5_K_M (6 GB) — it runs on a MacBook Pro M4 Max at 25+ tok/s and handles 90% of personal assistant workloads. Skip Llama 3 405B and Llama 4 Behemoth for local use — they require datacenter hardware for usable throughput. Skip Llama 3.2 vision variants unless you specifically need on-device vision — the text models are more mature and better supported.

For single-user local: Ollama + llama3.3:70b Q4_K_M on 2× RTX 4090 or Mac Studio M3 Ultra 192 GB via MLX-LM. For multi-user serving: vLLM 0.6.3+ with AWQ 4-bit on 4× H100 SXM — achieves ~8,000 tok/s at batch 64 with continuous batching and prefix caching enabled. For mobile/edge: llama.cpp Llama 3.1 8B Q4_0 on Snapdragon X Elite via ARM NEON — ~18 tok/s decode. For maximum single-GPU throughput: ExLlamaV2 4.0 bpw on RTX 5090 32 GB — ~45 tok/s decode with flash-attention. For datacenter: TensorRT-LLM FP8 on 8× H100 SXM — ~25,000 tok/s at batch 256. See GPU buyer guide.

When it doesn't work

Llama

Best entry point for local use

Deployment guidance

Recommended runtimes

Related families

Related — keep moving