CPU / no GPU
- Model class
- 1B-3B instruct models
- Settings
- Q4, short context, patience required
Try a tiny Qwen, Gemma, or Phi-class model before a 7B download.
Start with the machine in front of you: will the model fit, what quant should you use, how fast will it run, and when is local cheaper than cloud? This is the syllabus, router, and evidence loop we use to answer those questions without guessing.
This is a starter map, not a verdict. The real answer still depends on quantization, context length, runtime overhead, and what else is using memory. Use it to avoid bad first downloads, then verify the exact model in Will-It-Run.
Try a tiny Qwen, Gemma, or Phi-class model before a 7B download.
Start with a recent 7B/8B chat model, then raise context only if memory stays stable.
Use 7B/8B for speed, 14B when answer quality matters more than latency.
Benchmark both a fast 14B and a smaller-context 32B before choosing a daily driver.
Use 32B as the serious local baseline; treat 70B as a tradeoff experiment.
Pick by task: coding, multilingual chat, long context, or agent serving.
Install a runtime, pull a small model, confirm it answers locally, then know what to fix next.
Start the 10 min recipe →Check model size, quantization, context, and VRAM before you waste a download.
Run a fit check →Compare model families, sizes, prompts, and evidence instead of chasing the largest parameter count.
Browse models →Use measured local-AI suitability, VRAM headroom, and cost tradeoffs before buying hardware.
Open the hardware leaderboard →Move from demos into RAG, agents, serving, monitoring, and repeatable operator workflows.
Pick a course track →The external syllabus explains the field. These are the RunLocalAI-native courses and task recipes that turn it into operator muscle memory.
The long form. Multi-chapter tracks that take you from first install to running open-weight models with intent — foundations, builder, operator.
Browse courses →The short form. One job done end to end — steps, verification, and the failures that actually bite — for a specific local-AI task.
Browse how-to guides →Don’t try to do the whole list. Do these three in order and you’ll know more about local AI than 95% of the people who use it daily.
Pick Let’s build GPT from scratch. Two hours. Skips the math you don’t need; gives you the intuition you do.
Use /will-it-run to find a model + GPU combo that actually fits your VRAM. Don’t guess — the math is the math.
Every model is pickier than people assume. /prompting has tested kits per model — copy one before your first real conversation.
The decision tree we wish someone had handed us. Identify the layer blocking you, follow three free resources, then land on a RunLocalAI surface that closes the loop.
A benchmark submission helps the next reader answer a concrete question: this model, on this GPU, with this runtime and quant, produced this many tokens per second. That is how the framework moves from fit estimates to evidence-backed verdicts.
Help fill missing model x GPU x runtime cells with reproducible numbers.
Understand confidence labels before citing a number.
Pull catalog and benchmark data into your own tools.
Translate learning into monthly cost and break-even math.
runlocalai is built by Fredoline Eruo — an operator maintaining a local-AI hardware catalog, model library, fit framework, and measured benchmark table without a PhD or a research-lab budget. The list below is the syllabus that closed the gap between “I can install Ollama” and “I can defend every line on a /models page.” Each entry is hand-written. None of the links pay us. None of them are affiliate links. If a resource is missing from this list, it’s because we don’t recommend it — not because nobody offered us a kickback.
Twelve free resources, grouped. Every note ends with “Where it lands for us” — the specific decision on runlocalai that this resource sharpened.
Builds a neural net, then a tiny GPT, in plain Python from scratch. The single best free intuition for why a transformer does what it does, and once you’ve watched the weights matter you understand why quantization is lossy at all. Where it lands for us: picking Q4_K_M vs Q6_K on /will-it-run/custom stops being a guess.
The follow-up that trains GPT-2 end to end. The systems detail — mixed precision, kernel fusion, FlashAttention — is what separates “I read the paper” from “I have shipped this.” Where it lands for us: explains why a 24GB card runs a 13B model at Q4 but chokes once context grows — the KV cache math gets concrete.
The visual companion. Watch when you want geometric intuition for attention or backprop’s chain rule without writing a line of code. Pair with Karpathy — don’t substitute. Where it lands for us: the attention visualisation is why our /benchmarks numbers swing so much with context length.
The most comprehensive free written treatment of pre-training, scaling, RLHF, and evaluation we’ve found. Reads like a textbook chapter, not a blog post. Where it lands for us: when a vendor claims their 70B model “beats GPT-4,” CS324 gives you the eval-design vocabulary to tell whether the claim is honest or staged.
The course we’d recommend to someone who’d rather build than watch. Notebooks for tokenization, fine-tuning, evaluation — less theory, more code that runs on a free Colab GPU. Where it lands for us: recreate a tokenizer in the chapter on BPE and finally see why Qwen handles Chinese gracefully and Llama 3.x stumbles.
The top-down approach: ship a state-of-the-art image model in lesson 1, derive what makes it work over the rest. Slightly dated on language models specifically, but the engineering instincts transfer wholesale. Where it lands for us: fast.ai’s “always try the obvious thing first” mindset is why our /benchmarks report real tokens/sec rather than synthetic FLOPs.
Not a course. The README is short; the Discussions tab is where the actual engineering tradeoffs play out — GGUF format changes, quantization formats, kernel tuning per chip. Where it lands for us: when a quantization format changes, this is where it’s argued first, and we cite it on /tools/llama-cpp.
Skip the install section, read the architecture pages. Paged attention is the most consequential inference optimisation of the last three years; understanding it teaches why long contexts are expensive even at small batch sizes. Where it lands for us: maps directly to why /hardware pages surface both memory bandwidth and capacity — both matter, for different reasons.
Dettmers wrote bitsandbytes and did the original LLM.int8() work. The clearest posts on quantization tradeoffs anywhere for free. Start with his 4-bit piece. Where it lands for us: every variants table on a /models page shows Q4_K_M / Q6_K / Q8_0 — Dettmers explains why those specific grades exist and which to pick.
The cleanest from-scratch explanations of RLHF and DPO outside the original papers. Working-engineer level, not researcher. Where it lands for us: explains why DeepSeek R1 explicitly recommends no system prompt — it’s a quirk of its preference-tuning setup, and we surface it as a kit caveat on /models/deepseek-r1.
Two complementary surfaces. Arena is crowdsourced human preference — hardest to game. The Open LLM Leaderboard is benchmark-based — easier to game, useful for capability cuts (math, code, reasoning). Where it lands for us: our /benchmarks page measures inference speed; Arena/Leaderboard cover model quality. Use both. Neither alone is enough.
The paper that started the transformer era. Surprisingly readable. Worth a careful read after Karpathy’s GPT-from-scratch video — the paper makes more sense once you’ve built the thing from the ground up. Where it lands for us: every architecture decision since 2017 references this paper; reading it once means the rest of the field stops feeling like jargon.
Weights are stored as 32-bit or 16-bit floats during training. At inference you can re-encode them to 4-bit or 5-bit integers with surprisingly little quality loss — networks are over-parameterised, and most weights cluster around small values that compress well. Q4_K_M stores weights in groups of 256 with per-group fp16 scales, preserving the dynamic range that matters.
| Format | Bits/weight | File size | MMLU delta |
|---|---|---|---|
| FP16 | 16 | 14 GB | baseline |
| Q8_0 | 8.5 | 7.3 GB | −0.1% |
| Q6_K | 6.6 | 5.7 GB | −0.5% |
| Q5_K_M | 5.7 | 4.9 GB | −1.2% |
| Q4_K_M | 4.8 | 4.2 GB | −2.1% |
| Q4_0 | 4.5 | 3.9 GB | −3.5% |
| Q3_K_M | 3.9 | 3.2 GB | −5.8% |
| Q2_K | 3.0 | 2.5 GB | −12% |
[ block scale ][ block min ][ 256 × 4-bit quantized weights ]
fp16 fp16 128 bytesOperator takeaway: Q4_K_M is our default recommendation on /models for chat. For code-heavy or math-heavy use we bump to Q5_K_M or Q6_K because the accuracy delta there grows roughly 2× faster than on general MMLU.
A common surprise: an 8GB GPU loads a 7B Q4 model (~4GB of weights) and then OOMs once the conversation grows. That’s the KV cache — every token in context stores key+value vectors for every attention layer, and it scales linearly with context length.
KV_bytes = 2 × num_layers × num_kv_heads × head_dim
× context_tokens × precision_bytes
Llama 3.1 8B (32 layers, 8 KV heads, head_dim 128, fp16):
2 × 32 × 8 × 128 × ctx × 2 = 131,072 bytes per token
At 8K ctx → 1.07 GB
At 32K ctx → 4.29 GB
At 128K ctx → 17.18 GB ← exceeds a 16GB GPU alone| Component | 8K ctx | 32K ctx | 128K ctx |
|---|---|---|---|
| Weights | 4.7 GB | 4.7 GB | 4.7 GB |
| Activations | ~0.5 GB | ~0.5 GB | ~0.5 GB |
| KV cache | 1.07 GB | 4.29 GB | 17.18 GB |
| Framework overhead | ~0.5 GB | ~0.5 GB | ~0.5 GB |
| TOTAL | ≈ 6.8 GB | ≈ 10.0 GB | ≈ 22.9 GB |
Operator takeaway: when /will-it-run flags your rig as “fits at 8K but not 32K,” the KV cache is what changed — not the weights. vLLM’s paged attention is the engineering trick that makes long-context serving viable; it’s why production inference uses vLLM and your laptop uses llama.cpp.
A tokenizer’s vocabulary determines how efficiently your model reads each language. BPE trains its merges on the corpus, so a tokenizer trained mostly on English splits Chinese into many more tokens than a multilingual tokenizer.
| Language | Llama 3.x | Qwen 3 | Gemma 3 |
|---|---|---|---|
| English | ~250 | ~245 | ~248 |
| Spanish | ~280 | ~265 | ~270 |
| Code (Python) | ~210 | ~200 | ~205 |
| Chinese | ~750 | ~250 | ~265 |
| Yoruba | ~620 | ~340 | ~310 |
Start: [r] [u] [n] [n] [i] [n] [g] Merge 1: [r] [u] [n] [n] [in] [g] 'in' is frequent Merge 2: [r] [u] [n] [n] [ing] 'ing' is frequent Merge 3: [r] [u] [n] [ning] Merge 4: [running] one token = seven characters
Operator takeaway: more tokens per word = slower inference, more VRAM, and worse quality (the model spends attention budget on the same idea). That’s why Qwen 3 (119 training languages) handles many low-resource languages more efficiently than Llama 3.x-family tokenizers, and why our /prompting hub flags chat-template differences per family.
The Chinchilla paper (DeepMind, 2022) showed that for a fixed compute budget, smaller-model-more-data beats bigger-model-less-data. Rule of thumb: ~20 tokens of training data per parameter for compute-optimal training. Modern releases blow past that ratio.
| Model | Date | Params | Train tokens | Tokens/param |
|---|---|---|---|---|
| GPT-3 | 2020-06 | 175B | 300B | 1.7 |
| Chinchilla | 2022-03 | 70B | 1.4T | 20.0 |
| Llama 2 70B | 2023-07 | 70B | 2.0T | 28.6 |
| Llama 3 8B | 2024-04 | 8B | 15T | 1,875 |
| Llama 3.3 70B | 2024-12 | 70B | 15T | 214 |
Operator takeaway: smaller models keep getting better faster than they shrink. A modern 8B can now beat much larger older releases on the practical tasks people actually run locally. If you’re VRAM-constrained, this is the best news local AI has. We track new releases on /models/new with each row’s tokens-per-param surfaced so you can spot the over-trained underdogs.
A base model predicts the next token, period. An instruct model has been fine-tuned in two stages: supervised fine-tuning on curated examples, then preference tuning via RLHF (reward model + RL) or DPO (a direct optimisation shortcut).
Prompt: "What is 12 × 17?" Llama 3 8B (base): "What is 12 × 17? What is 13 × 17? What is 14 × 17? ..." (continues the apparent pattern — pure next-token prediction) Llama 3 8B Instruct: "12 × 17 = 204. Computing: 12 × 17 = 12 × (10 + 7) = 120 + 84 = 204." (follows the implicit instruction to answer)
L_DPO = -log σ( β · log π(y_w|x)/π_ref(y_w|x)
- β · log π(y_l|x)/π_ref(y_l|x) )
where:
π = the model being optimized
π_ref = a frozen reference (usually the SFT model)
y_w = preferred response (winner)
y_l = dispreferred response (loser)
β = a tuning constant, typically 0.1–0.5
In English: push the model toward chosen responses and away
from rejected ones — while staying close to the reference.Operator takeaway: the preference-tuning recipe shapes a model’s quirks more than its size does. DeepSeek R1 was tuned to reason between <think> tags, so a system prompt actually degrades its performance. Our /prompting kits surface these quirks per model so you don’t fight the training.
Perplexity (how well a model predicts the next token) was the original metric. It’s still useful for pre-training research; it’s nearly useless for choosing a model to use. Modern evaluation splits two ways: benchmark suites and human-preference ranking. Neither alone is enough.
| Lens | What it catches | What it misses | Use it for |
|---|---|---|---|
| Perplexity | Next-token fit | Instruction quality | Pre-training sanity checks |
| Task benchmarks | Math, coding, knowledge | Your exact workflow | Capability screening |
| Human preference | Conversation quality | Verbosity and style bias | Chat model shortlists |
| Local tok/s | Runtime reality | Model intelligence | Will-It-Run verdicts |
Operator takeaway: same model, different rank, different conclusion depending on which lens you pick. Use multiple. Our /benchmarks measures local inference speed (the rightmost column) because that’s our lane — we link to Arena and the Leaderboard for quality so you have both.
Use the framework page when you need a compact citation for local-AI fit: model quality is not enough; the useful answer combines VRAM fit, quantization, context, speed evidence, and cost.
Open framework →Suggested citation: RunLocalAI, "The RunLocalAI Will-It-Run Framework," reviewed 2026-05-29, https://www.runlocalai.co/resources/will-it-run-framework.
Next link-health check: 2026-08-29 · flag bad links to /contact