02. Parameter Count Guide

Chapter 2 of 20 · 10 min

Parameter count is the most discussed model spec, so understanding what it means-and what it does not mean-is essential.

Where parameters come from:

A transformer language model has several parameter groups:

  • Embedding layer: vocab_size x embedding_dim. For a 32k vocabulary with 4096-dim embeddings, this alone is ~131M parameters.
  • Attention weights: 4 x layers x head_count x head_dim�. The Q, K, V, and output projections for each layer.
  • FFN (feed-forward) weights: typically 4 x embedding_dim x ff_dim. This is usually 60-70% of total parameters in dense models.
  • Layer norms: a few thousand parameters per layer, negligible in total count.

Why parameter count varies in meaning:

A 7B model with different architectures has different capability profiles despite the same parameter count. MoE models (covered in Chapter 3) can match dense models with far fewer active parameters per token.

Rough capability guide (as of early 2026):

Parameters Approximate capability Notes
1-3B Simple tasks, fast inference Good for classification, extraction
7B General purpose, decent reasoning Minimum for coherent long-form output
13B Noticeably better reasoning Often runs on consumer GPUs
33B+ Strong reasoning, long context Requires 24GB+ VRAM
70B+ Near-top reasoning Often MoE, 4096 context minimum

The catch: Two 7B models from different families or training runs can differ more in capability than a 7B and a 13B from the same lineage. Parameter count is a rough guide, not a capability guarantee.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Find the parameter counts and architectural differences between Llama 3.1 8B and Mistral 7B. Calculate what percentage of parameters come from attention vs FFN layers for each.