BLK · LEARNwill-it-run learning hub

Learn local AI that will actually run

Start with the machine in front of you: will the model fit, what quant should you use, how fast will it run, and when is local cheaper than cloud? This is the syllabus, router, and evidence loop we use to answer those questions without guessing.

Last reviewed: 2026-05-29 · Next review by: 2026-08-29
Curated resources
12
Topical bridges
6
External cost
$0
Affiliate links
0
FITstart from your memory budget

What should I try first?

This is a starter map, not a verdict. The real answer still depends on quantization, context length, runtime overhead, and what else is using memory. Use it to avoid bad first downloads, then verify the exact model in Will-It-Run.

Verify exact fit →
System RAM

CPU / no GPU

starter
Model class
1B-3B instruct models
Settings
Q4, short context, patience required

Try a tiny Qwen, Gemma, or Phi-class model before a 7B download.

Entry GPU

8GB VRAM

starter
Model class
7B-8B at Q4
Settings
4K-8K context; watch KV cache

Start with a recent 7B/8B chat model, then raise context only if memory stays stable.

Comfortable 7B

12GB VRAM

starter
Model class
7B-14B at Q4/Q5
Settings
8K context is realistic; 32K needs care

Use 7B/8B for speed, 14B when answer quality matters more than latency.

Strong local rig

16GB VRAM

starter
Model class
14B at Q5 or some 32B at Q4
Settings
Keep headroom for context and apps

Benchmark both a fast 14B and a smaller-context 32B before choosing a daily driver.

Enthusiast GPU

24GB VRAM

starter
Model class
32B at Q4/Q5
Settings
Good balance of quality and usable speed

Use 32B as the serious local baseline; treat 70B as a tradeoff experiment.

Large local models

32GB+ / unified

starter
Model class
32B comfortably; 70B with tradeoffs
Settings
Leave OS and KV-cache headroom

Pick by task: coding, multilingual chat, long context, or agent serving.

ROUTERpick the job

What are you here to do?

Choose path → learn → act
BLK · CURRICULUMour own, written end to end

The written curriculum

The external syllabus explains the field. These are the RunLocalAI-native courses and task recipes that turn it into operator muscle memory.

Browse all recipes →

The shortest path, if you’re new

Don’t try to do the whole list. Do these three in order and you’ll know more about local AI than 95% of the people who use it daily.

STEP 01

Watch one Karpathy video

Pick Let’s build GPT from scratch. Two hours. Skips the math you don’t need; gives you the intuition you do.

STEP 02

Pick something that fits

Use /will-it-run to find a model + GPU combo that actually fits your VRAM. Don’t guess — the math is the math.

STEP 03

Use a tested system prompt

Every model is pickier than people assume. /prompting has tested kits per model — copy one before your first real conversation.

TREEoriginal router

The Local-Inference Learning Tree

The decision tree we wish someone had handed us. Identify the layer blocking you, follow three free resources, then land on a RunLocalAI surface that closes the loop.

EVIDENCEturn learning into data

Learn, run, measure, cite

A benchmark submission helps the next reader answer a concrete question: this model, on this GPU, with this runtime and quant, produced this many tokens per second. That is how the framework moves from fit estimates to evidence-backed verdicts.

OP · PATHwhy this list exists

Operator path

runlocalai is built by Fredoline Eruo — an operator maintaining a local-AI hardware catalog, model library, fit framework, and measured benchmark table without a PhD or a research-lab budget. The list below is the syllabus that closed the gap between “I can install Ollama” and “I can defend every line on a /models page.” Each entry is hand-written. None of the links pay us. None of them are affiliate links. If a resource is missing from this list, it’s because we don’t recommend it — not because nobody offered us a kickback.

Theory we trust

Twelve free resources, grouped. Every note ends with “Where it lands for us” — the specific decision on runlocalai that this resource sharpened.

01 · FOUNDATIONS
Neural Networks: Zero to Hero
Andrej Karpathy · 8 YouTube videos · ~15 hours total

Builds a neural net, then a tiny GPT, in plain Python from scratch. The single best free intuition for why a transformer does what it does, and once you’ve watched the weights matter you understand why quantization is lossy at all. Where it lands for us: picking Q4_K_M vs Q6_K on /will-it-run/custom stops being a guess.

karpathy.ai/zero-to-hero.html
Let's reproduce GPT-2 (124M)
Andrej Karpathy · single YouTube video · ~4 hours

The follow-up that trains GPT-2 end to end. The systems detail — mixed precision, kernel fusion, FlashAttention — is what separates “I read the paper” from “I have shipped this.” Where it lands for us: explains why a 24GB card runs a 13B model at Q4 but chokes once context grows — the KV cache math gets concrete.

www.youtube.com/watch?v=l8pRSuU81PU
Neural Networks + Attention
3Blue1Brown (Grant Sanderson) · YouTube playlist + attention deep-dive · ~3 hours

The visual companion. Watch when you want geometric intuition for attention or backprop’s chain rule without writing a line of code. Pair with Karpathy — don’t substitute. Where it lands for us: the attention visualisation is why our /benchmarks numbers swing so much with context length.

www.3blue1brown.com/lessons/neural-networks
CS324 — Large Language Models
Tatsu Hashimoto, Percy Liang · Stanford · free lecture notes · self-paced

The most comprehensive free written treatment of pre-training, scaling, RLHF, and evaluation we’ve found. Reads like a textbook chapter, not a blog post. Where it lands for us: when a vendor claims their 70B model “beats GPT-4,” CS324 gives you the eval-design vocabulary to tell whether the claim is honest or staged.

stanford-cs324.github.io/winter2022
02 · PRACTICAL
Hugging Face LLM Course
Hugging Face · free, hands-on notebooks · ~20 hours

The course we’d recommend to someone who’d rather build than watch. Notebooks for tokenization, fine-tuning, evaluation — less theory, more code that runs on a free Colab GPU. Where it lands for us: recreate a tokenizer in the chapter on BPE and finally see why Qwen handles Chinese gracefully and Llama 3.x stumbles.

huggingface.co/learn/llm-course
Practical Deep Learning for Coders
Jeremy Howard · fast.ai · free course · ~30 hours

The top-down approach: ship a state-of-the-art image model in lesson 1, derive what makes it work over the rest. Slightly dated on language models specifically, but the engineering instincts transfer wholesale. Where it lands for us: fast.ai’s “always try the obvious thing first” mindset is why our /benchmarks report real tokens/sec rather than synthetic FLOPs.

course.fast.ai
03 · SYSTEMS & INFERENCE
llama.cpp — README + Discussions
ggml-org team · README + GitHub Discussions · skim-as-needed

Not a course. The README is short; the Discussions tab is where the actual engineering tradeoffs play out — GGUF format changes, quantization formats, kernel tuning per chip. Where it lands for us: when a quantization format changes, this is where it’s argued first, and we cite it on /tools/llama-cpp.

github.com/ggml-org/llama.cpp
vLLM documentation — paged attention
vLLM project · Sphinx docs · 2-3 hours focused

Skip the install section, read the architecture pages. Paged attention is the most consequential inference optimisation of the last three years; understanding it teaches why long contexts are expensive even at small batch sizes. Where it lands for us: maps directly to why /hardware pages surface both memory bandwidth and capacity — both matter, for different reasons.

docs.vllm.ai
Quantization deep-dives
Tim Dettmers · blog · ~2 hours per post

Dettmers wrote bitsandbytes and did the original LLM.int8() work. The clearest posts on quantization tradeoffs anywhere for free. Start with his 4-bit piece. Where it lands for us: every variants table on a /models page shows Q4_K_M / Q6_K / Q8_0 — Dettmers explains why those specific grades exist and which to pick.

timdettmers.com
04 · ALIGNMENT & EVALUATION
RLHF & DPO from scratch
Sebastian Raschka · blog (Ahead of AI) · ~1 hour per article

The cleanest from-scratch explanations of RLHF and DPO outside the original papers. Working-engineer level, not researcher. Where it lands for us: explains why DeepSeek R1 explicitly recommends no system prompt — it’s a quirk of its preference-tuning setup, and we surface it as a kit caveat on /models/deepseek-r1.

magazine.sebastianraschka.com
Chatbot Arena + Open LLM Leaderboard
LMSYS · Hugging Face · live leaderboards · ~10 min orientation

Two complementary surfaces. Arena is crowdsourced human preference — hardest to game. The Open LLM Leaderboard is benchmark-based — easier to game, useful for capability cuts (math, code, reasoning). Where it lands for us: our /benchmarks page measures inference speed; Arena/Leaderboard cover model quality. Use both. Neither alone is enough.

arena.ai
Attention Is All You Need
Vaswani et al. (2017) · arXiv paper, 11 pages · 1 careful pass

The paper that started the transformer era. Surprisingly readable. Worth a careful read after Karpathy’s GPT-from-scratch video — the paper makes more sense once you’ve built the thing from the ground up. Where it lands for us: every architecture decision since 2017 references this paper; reading it once means the rest of the field stops feeling like jargon.

arxiv.org/abs/1706.03762
BRIDGESwhere theory meets ops

Six topical deep-dives

BRIDGE · A

Why quantization (mostly) works

Weights are stored as 32-bit or 16-bit floats during training. At inference you can re-encode them to 4-bit or 5-bit integers with surprisingly little quality loss — networks are over-parameterised, and most weights cluster around small values that compress well. Q4_K_M stores weights in groups of 256 with per-group fp16 scales, preserving the dynamic range that matters.

Quantization formats vs FP16 baseline (7B model, approximate)
FormatBits/weightFile sizeMMLU delta
FP161614 GBbaseline
Q8_08.57.3 GB−0.1%
Q6_K6.65.7 GB−0.5%
Q5_K_M5.74.9 GB−1.2%
Q4_K_M4.84.2 GB−2.1%
Q4_04.53.9 GB−3.5%
Q3_K_M3.93.2 GB−5.8%
Q2_K3.02.5 GB−12%
Source: llama.cpp benchmarks · Dettmers analyses
Q4_K_M block layout (per 256 weights)
[ block scale ][ block min ][ 256 × 4-bit quantized weights ]
       fp16         fp16              128 bytes

Operator takeaway: Q4_K_M is our default recommendation on /models for chat. For code-heavy or math-heavy use we bump to Q5_K_M or Q6_K because the accuracy delta there grows roughly 2× faster than on general MMLU.

BRIDGE · B

What VRAM actually holds

A common surprise: an 8GB GPU loads a 7B Q4 model (~4GB of weights) and then OOMs once the conversation grows. That’s the KV cache — every token in context stores key+value vectors for every attention layer, and it scales linearly with context length.

KV cache formula (per request)
KV_bytes = 2 × num_layers × num_kv_heads × head_dim
         × context_tokens × precision_bytes

Llama 3.1 8B (32 layers, 8 KV heads, head_dim 128, fp16):
  2 × 32 × 8 × 128 × ctx × 2  =  131,072 bytes per token

  At  8K ctx →  1.07 GB
  At 32K ctx →  4.29 GB
  At 128K ctx → 17.18 GB  ← exceeds a 16GB GPU alone
VRAM accounting · Llama 3.1 8B · Q4_K_M
Component8K ctx32K ctx128K ctx
Weights4.7 GB4.7 GB4.7 GB
Activations~0.5 GB~0.5 GB~0.5 GB
KV cache1.07 GB4.29 GB17.18 GB
Framework overhead~0.5 GB~0.5 GB~0.5 GB
TOTAL≈ 6.8 GB≈ 10.0 GB≈ 22.9 GB
Source: derived from Llama 3.1 8B config + llama.cpp defaults

Operator takeaway: when /will-it-run flags your rig as “fits at 8K but not 32K,” the KV cache is what changed — not the weights. vLLM’s paged attention is the engineering trick that makes long-context serving viable; it’s why production inference uses vLLM and your laptop uses llama.cpp.

BRIDGE · C

Tokenizers → multilingual quality

A tokenizer’s vocabulary determines how efficiently your model reads each language. BPE trains its merges on the corpus, so a tokenizer trained mostly on English splits Chinese into many more tokens than a multilingual tokenizer.

Tokens per 1,000 characters (approximate)
LanguageLlama 3.xQwen 3Gemma 3
English~250~245~248
Spanish~280~265~270
Code (Python)~210~200~205
Chinese~750~250~265
Yoruba~620~340~310
Source: approximate; varies with content. Lower = better
BPE merge example: building one token from seven
Start:    [r] [u] [n] [n] [i] [n] [g]
Merge 1:  [r] [u] [n] [n] [in] [g]     'in' is frequent
Merge 2:  [r] [u] [n] [n] [ing]        'ing' is frequent
Merge 3:  [r] [u] [n] [ning]
Merge 4:  [running]                    one token = seven characters

Operator takeaway: more tokens per word = slower inference, more VRAM, and worse quality (the model spends attention budget on the same idea). That’s why Qwen 3 (119 training languages) handles many low-resource languages more efficiently than Llama 3.x-family tokenizers, and why our /prompting hub flags chat-template differences per family.

BRIDGE · D

Chinchilla → “should I wait for a smaller model?”

The Chinchilla paper (DeepMind, 2022) showed that for a fixed compute budget, smaller-model-more-data beats bigger-model-less-data. Rule of thumb: ~20 tokens of training data per parameter for compute-optimal training. Modern releases blow past that ratio.

Tokens per parameter over time (trend matters more than exact values)
ModelDateParamsTrain tokensTokens/param
GPT-32020-06175B300B1.7
Chinchilla2022-0370B1.4T20.0
Llama 2 70B2023-0770B2.0T28.6
Llama 3 8B2024-048B15T1,875
Llama 3.3 70B2024-1270B15T214
Source: vendor announcements + model cards

Operator takeaway: smaller models keep getting better faster than they shrink. A modern 8B can now beat much larger older releases on the practical tasks people actually run locally. If you’re VRAM-constrained, this is the best news local AI has. We track new releases on /models/new with each row’s tokens-per-param surfaced so you can spot the over-trained underdogs.

BRIDGE · E

Why instruct ≠ base

A base model predicts the next token, period. An instruct model has been fine-tuned in two stages: supervised fine-tuning on curated examples, then preference tuning via RLHF (reward model + RL) or DPO (a direct optimisation shortcut).

Same prompt, base vs instruct
Prompt: "What is 12 × 17?"

Llama 3 8B (base):
  "What is 12 × 17? What is 13 × 17? What is 14 × 17? ..."
  (continues the apparent pattern — pure next-token prediction)

Llama 3 8B Instruct:
  "12 × 17 = 204. Computing: 12 × 17 = 12 × (10 + 7)
   = 120 + 84 = 204."
  (follows the implicit instruction to answer)
The DPO loss (a single elegant equation)
L_DPO = -log σ( β · log π(y_w|x)/π_ref(y_w|x)
                 - β · log π(y_l|x)/π_ref(y_l|x) )

where:
  π     = the model being optimized
  π_ref = a frozen reference (usually the SFT model)
  y_w   = preferred response  (winner)
  y_l   = dispreferred response (loser)
  β     = a tuning constant, typically 0.1–0.5

In English: push the model toward chosen responses and away
from rejected ones — while staying close to the reference.

Operator takeaway: the preference-tuning recipe shapes a model’s quirks more than its size does. DeepSeek R1 was tuned to reason between <think> tags, so a system prompt actually degrades its performance. Our /prompting kits surface these quirks per model so you don’t fight the training.

BRIDGE · F

Why “vibes-check” beats perplexity

Perplexity (how well a model predicts the next token) was the original metric. It’s still useful for pre-training research; it’s nearly useless for choosing a model to use. Modern evaluation splits two ways: benchmark suites and human-preference ranking. Neither alone is enough.

Four eval lenses — they answer different questions
LensWhat it catchesWhat it missesUse it for
PerplexityNext-token fitInstruction qualityPre-training sanity checks
Task benchmarksMath, coding, knowledgeYour exact workflowCapability screening
Human preferenceConversation qualityVerbosity and style biasChat model shortlists
Local tok/sRuntime realityModel intelligenceWill-It-Run verdicts
Source: methodology summary; cite source leaderboards for current numeric scores

Operator takeaway: same model, different rank, different conclusion depending on which lens you pick. Use multiple. Our /benchmarks measures local inference speed (the rightmost column) because that’s our lane — we link to Arena and the Leaderboard for quality so you have both.

PATHSby goal

Reading paths

“I want to understand what’s happening when I prompt Llama”
  1. Karpathy: build GPT
  2. 3Blue1Brown: attention
  3. Raschka: RLHF/DPO
  4. Use a local LLM prompt kit
“I want to actually build or fine-tune something”
  1. HF LLM Course
  2. fast.ai
  3. vLLM docs
  4. Choose a local AI runtime
NEXTwhere to go after this page

Next tracks we recommend

  • Multimodal vision / audio: the Hugging Face diffusion course is the cleanest free start; then use our tools directory to choose a local runtime.
  • Production serving at scale: vLLM docs above are the floor; the SkyPilot blog and Anyscale’s docs cover the ceiling.
  • Agents and tool calling: still chaotic; the OpenAI cookbook and Anthropic’s docs are the most stable references. We surface per-model tool-call formats in /prompting kits.
  • The math of training: Karpathy covers it well enough that we never had to leave his videos. If you do, the back half of CS324 has the references; then come back to /benchmarks and test whether the local model is fast enough.
CITEreference asset

Cite the Will-It-Run learning framework

Use the framework page when you need a compact citation for local-AI fit: model quality is not enough; the useful answer combines VRAM fit, quantization, context, speed evidence, and cost.

Open framework →
Title
The RunLocalAI Will-It-Run Framework
Author
Fredoline Eruo
Reviewed
2026-05-29

Suggested citation: RunLocalAI, "The RunLocalAI Will-It-Run Framework," reviewed 2026-05-29, https://www.runlocalai.co/resources/will-it-run-framework.

LOGpage history

Changelog

  1. 2026-05-29v1.2 — promoted Learn from reading list to learning hub: added memory-first fit paths, intent router, responsive learning tree, evidence flywheel, framework citation, stronger Will-It-Run SEO, and accessibility hardening.
  2. 2026-05-23v1.1 — added structured data (Course + ItemList + BreadcrumbList), expanded all 6 topical bridges with tables and worked examples, added the Local-Inference Learning Tree, added changelog + Last Reviewed.
  3. 2026-05-23v1.0 — page launched: 12 curated resources, 6 topical bridges, 3-step shortest path, operator-path callout, reading paths by goal.

Next link-health check: 2026-08-29 · flag bad links to /contact