Parameter Count Guide — Understanding AI Models (Chapter 2)

Parameter count is the most discussed model spec, so understanding what it means-and what it does not mean-is essential.

Where parameters come from:

A transformer language model has several parameter groups:

Embedding layer: vocab_size x embedding_dim. For a 32k vocabulary with 4096-dim embeddings, this alone is ~131M parameters.
Attention weights: 4 x layers x head_count x head_dim�. The Q, K, V, and output projections for each layer.
FFN (feed-forward) weights: typically 4 x embedding_dim x ff_dim. This is usually 60-70% of total parameters in dense models.
Layer norms: a few thousand parameters per layer, negligible in total count.

Why parameter count varies in meaning:

A 7B model with different architectures has different capability profiles despite the same parameter count. MoE models (covered in Chapter 3) can match dense models with far fewer active parameters per token.

Rough capability guide (as of early 2026):

Parameters	Approximate capability	Notes
1-3B	Simple tasks, fast inference	Good for classification, extraction
7B	General purpose, decent reasoning	Minimum for coherent long-form output
13B	Noticeably better reasoning	Often runs on consumer GPUs
33B+	Strong reasoning, long context	Requires 24GB+ VRAM
70B+	Near-top reasoning	Often MoE, 4096 context minimum

The catch: Two 7B models from different families or training runs can differ more in capability than a 7B and a 13B from the same lineage. Parameter count is a rough guide, not a capability guarantee.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.