RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Notable models & companies / Llama (Meta)
Notable models & companies

Llama (Meta)

Llama is a family of open-weight large language models (LLMs) developed by Meta, starting with Llama 1 in 2023 and continuing through Llama 2, Llama 3, and Llama 3.1. These models are designed for text generation and chat, with sizes ranging from 8B to 405B parameters. Operators encounter Llama as the default or recommended model in many local AI runtimes (Ollama, llama.cpp, LM Studio) because its permissive license allows free use and redistribution. The models use a transformer architecture with grouped-query attention and are often quantized (e.g., Q4_K_M) to fit consumer VRAM. Llama's popularity means most local AI software prioritizes compatibility and optimization for this family.

Deeper dive

Meta released Llama 1 in February 2023 as a research-only model, then Llama 2 in July 2023 with a commercial-friendly license. Llama 3 (April 2024) introduced 8B and 70B sizes, and Llama 3.1 (July 2024) added a 405B model and extended context length to 128K tokens. The architecture uses a decoder-only transformer with RoPE (rotary position embeddings), SwiGLU activation, and grouped-query attention (GQA) for efficiency. For operators, the key practical difference between versions is license and performance: Llama 2 requires a commercial license for apps with >700M monthly users, while Llama 3.1 is more permissive. The 8B model at Q4 quantization (5 GB VRAM) fits on most consumer GPUs; the 70B Q4 (40 GB) requires a 48 GB card or offloading. The 405B model is impractical for local use without multiple GPUs or heavy quantization. Most runtimes (llama.cpp, Ollama) auto-detect and optimize for Llama architectures, making them the easiest models to run locally.

Practical example

An operator with an RTX 3090 (24 GB VRAM) can run Llama 3.1 8B at Q4_K_M (5 GB) with full 128K context, achieving ~50 tok/s. The same card can run Llama 3.1 70B at Q2_K (20 GB) but with reduced quality and 10 tok/s. For the 405B model, even Q2_K (120 GB) exceeds consumer VRAM, requiring multi-GPU or CPU offload at <1 tok/s.

Workflow example

In Ollama, running ollama pull llama3.1:8b downloads the model (5 GB) and stores it in `/.ollama/models/blobs. The runtime then loads it into VRAM; if VRAM is insufficient, it falls back to system RAM offload, dropping tokens/sec. In llama.cpp, the command ./main -m Meta-Llama-3.1-8B.Q4_K_M.gguf -p "Hello"` loads the quantized GGUF file and runs inference. LM Studio provides a GUI to download and chat with Llama models, showing VRAM usage and token rate in real time.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →