04. What is a Model, Really?
A Model is a File
Let's demystify this. A "model" is a file. Specifically, it's a file containing billions of numbers that represent learned patterns.
When you download a model, you're downloading:
- A set of weights (the learned numbers)
- A configuration file (architecture details)
- Sometimes, additional files (tokenizer, vocab)
The file size tells you a lot:
- Llama 3.2 7B (FP16, full precision): ~14GB
- Llama 3.2 7B (Q4_K_M, 4-bit quantization): ~4GB
- TinyLlama 1.1B (Q4): ~600MB
Same model, different precision. This is where quantization comes in.
What Are Weights?
During training, the model learns to predict text. Each "prediction rule" it learns is stored as a weight—a floating-point number.
A model with 7 billion parameters has 7,000,000,000 weights. That's a lot of numbers.
In full precision (FP32), each weight is 4 bytes. So a 7B model = 28GB just for weights. In 4-bit quantization (Q4), each weight is ~0.5 bytes. Same model = ~4GB.
What do you lose with quantization?
Quantization is lossy compression. You trade some accuracy for file size. For most tasks, 4-bit quantization loses 1-3% accuracy—a worthwhile tradeoff for fitting the model in your hardware.
Different quantization formats:
- FP16: Full precision, largest size (for reference only, typically too big)
- Q5_K_M: High quality, moderate size (~5.5GB for 7B)
- Q4_K_M: Good quality/size balance (~4GB for 7B) - most common
- Q3_K_M: Lower quality, smaller size (~3.5GB for 7B)
- Q2_K: Lowest practical quality (~2.5GB for 7B) - often too degraded
For most use cases, Q4_K_M is the sweet spot. Q5 if you have the space, Q3 if you don't.
The Architecture: Transformers
Most modern language models use the Transformer architecture. You don't need to understand the math, but you need to know the pieces:
Tokenizer: Converts your text into tokens (pieces of words). Different models use different tokenizers—that's why tokens/second can vary across models.
Embeddings: Converts tokens into number vectors.
Attention layers: The core of Transformers—computes relationships between tokens (this is what makes "context" work).
Feed-forward layers: Processes the attention output.
Output layer: Converts numbers back to token probabilities.
This architecture is why models can handle long contexts—attention computes relationships between all tokens in the sequence.
Model Files You May Encounter
GGUF (GPT Generated Unified Format): The format used by llama.cpp and compatible tools. This is what you'll typically download for local AI.
PyTorch (.pt, .pth): Original training format, not optimized for inference.
Safetensors (.safetensors): Safer PyTorch alternative, but still needs conversion for efficient local use.
ONNX: Cross-platform format, some local tools support it.
For local AI, GGUF is the standard. Your tooling will handle the format—knowing this helps when you see file extensions and understand what you're downloading.
Go to a model repository like The Bloke's Hugging Face collection (huggingface.co/TheBloke) and look at Llama 3.2 7B quantized models. Notice the file sizes: compare Q2_K, Q3_K_M, Q4_K_M, Q5_K_M. Calculate what percentage size reduction each represents compared to the FP16 baseline (~14GB).