Llama (Meta)
Llama is a family of open-weight large language models (LLMs) developed by Meta, starting with Llama 1 in 2023 and continuing through Llama 2, Llama 3, and Llama 3.1. These models are designed for text generation and chat, with sizes ranging from 8B to 405B parameters. Operators encounter Llama as the default or recommended model in many local AI runtimes (Ollama, llama.cpp, LM Studio) because its permissive license allows free use and redistribution. The models use a transformer architecture with grouped-query attention and are often quantized (e.g., Q4_K_M) to fit consumer VRAM. Llama's popularity means most local AI software prioritizes compatibility and optimization for this family.
Deeper dive
Meta released Llama 1 in February 2023 as a research-only model, then Llama 2 in July 2023 with a commercial-friendly license. Llama 3 (April 2024) introduced 8B and 70B sizes, and Llama 3.1 (July 2024) added a 405B model and extended context length to 128K tokens. The architecture uses a decoder-only transformer with RoPE (rotary position embeddings), SwiGLU activation, and grouped-query attention (GQA) for efficiency. For operators, the key practical difference between versions is license and performance: Llama 2 requires a commercial license for apps with >700M monthly users, while Llama 3.1 is more permissive. The 8B model at Q4 quantization (5 GB VRAM) fits on most consumer GPUs; the 70B Q4 (40 GB) requires a 48 GB card or offloading. The 405B model is impractical for local use without multiple GPUs or heavy quantization. Most runtimes (llama.cpp, Ollama) auto-detect and optimize for Llama architectures, making them the easiest models to run locally.
Practical example
An operator with an RTX 3090 (24 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) with full 128K context, achieving ~50 tok/s. The same card can run Llama 3.1 70B at Q2_K (20 GB) but with reduced quality and 10 tok/s. For the 405B model, even Q2_K (120 GB) exceeds consumer VRAM, requiring multi-GPU or CPU offload at <1 tok/s.
Workflow example
In Ollama, running ollama pull llama3.1:8b downloads the model (5 GB) and stores it in `/.ollama/models/blobs. The runtime then loads it into VRAM; if VRAM is insufficient, it falls back to system RAM offload, dropping tokens/sec. In llama.cpp, the command ./main -m Meta-Llama-3.1-8B.Q4_K_M.gguf -p "Hello"` loads the quantized GGUF file and runs inference. LM Studio provides a GUI to download and chat with Llama models, showing VRAM usage and token rate in real time.
Reviewed by Fredoline Eruo. See our editorial policy.