RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /What is Local AI — And Why It Matters
  6. /Ch. 4
What is Local AI — And Why It Matters

04. What is a Model, Really?

Chapter 4 of 20 · 18 min
KEY INSIGHT

A model is a large file of numbers (weights) representing learned patterns, and quantization reduces file size by storing weights with less precision—Q4 quantization typically gets 7B models down to 4GB while losing only 1-3% accuracy.

A Model is a File

Let's demystify this. A "model" is a file. Specifically, it's a file containing billions of numbers that represent learned patterns.

When you download a model, you're downloading:

  • A set of weights (the learned numbers)
  • A configuration file (architecture details)
  • Sometimes, additional files (tokenizer, vocab)

The file size tells you a lot:

  • Llama 3.2 7B (FP16, full precision): ~14GB
  • Llama 3.2 7B (Q4_K_M, 4-bit quantization): ~4GB
  • TinyLlama 1.1B (Q4): ~600MB

Same model, different precision. This is where quantization comes in.

What Are Weights?

During training, the model learns to predict text. Each "prediction rule" it learns is stored as a weight—a floating-point number.

A model with 7 billion parameters has 7,000,000,000 weights. That's a lot of numbers.

In full precision (FP32), each weight is 4 bytes. So a 7B model = 28GB just for weights. In 4-bit quantization (Q4), each weight is ~0.5 bytes. Same model = ~4GB.

What do you lose with quantization?

Quantization is lossy compression. You trade some accuracy for file size. For most tasks, 4-bit quantization loses 1-3% accuracy—a worthwhile tradeoff for fitting the model in your hardware.

Different quantization formats:

  • FP16: Full precision, largest size (for reference only, typically too big)
  • Q5_K_M: High quality, moderate size (~5.5GB for 7B)
  • Q4_K_M: Good quality/size balance (~4GB for 7B) - most common
  • Q3_K_M: Lower quality, smaller size (~3.5GB for 7B)
  • Q2_K: Lowest practical quality (~2.5GB for 7B) - often too degraded

For most use cases, Q4_K_M is the sweet spot. Q5 if you have the space, Q3 if you don't.

The Architecture: Transformers

Most modern language models use the Transformer architecture. You don't need to understand the math, but you need to know the pieces:

Tokenizer: Converts your text into tokens (pieces of words). Different models use different tokenizers—that's why tokens/second can vary across models.

Embeddings: Converts tokens into number vectors.

Attention layers: The core of Transformers—computes relationships between tokens (this is what makes "context" work).

Feed-forward layers: Processes the attention output.

Output layer: Converts numbers back to token probabilities.

This architecture is why models can handle long contexts—attention computes relationships between all tokens in the sequence.

Model Files You May Encounter

GGUF (GPT Generated Unified Format): The format used by llama.cpp and compatible tools. This is what you'll typically download for local AI.

PyTorch (.pt, .pth): Original training format, not optimized for inference.

Safetensors (.safetensors): Safer PyTorch alternative, but still needs conversion for efficient local use.

ONNX: Cross-platform format, some local tools support it.

For local AI, GGUF is the standard. Your tooling will handle the format—knowing this helps when you see file extensions and understand what you're downloading.

EXERCISE

Go to a model repository like The Bloke's Hugging Face collection (huggingface.co/TheBloke) and look at Llama 3.2 7B quantized models. Notice the file sizes: compare Q2_K, Q3_K_M, Q4_K_M, Q5_K_M. Calculate what percentage size reduction each represents compared to the FP16 baseline (~14GB).

← Chapter 3
What Can You Actually Run Locally?
Chapter 5 →
The Economics