BLK · LEARNwill-it-run learning hub

Learn local AI that will actually run

Start with the machine in front of you: will the model fit, what quant should you use, how fast will it run, and when is local cheaper than cloud? This is the syllabus, router, and evidence loop we use to answer those questions without guessing.

Check what runs Pick by memory First 10 minutes

Last reviewed: 2026-05-29 · Next review by: 2026-08-29

Curated resources

Topical bridges

External cost

Affiliate links

FITstart from your memory budget

What should I try first?

This is a starter map, not a verdict. The real answer still depends on quantization, context length, runtime overhead, and what else is using memory. Use it to avoid bad first downloads, then verify the exact model in Will-It-Run.

Verify exact fit →

System RAM

CPU / no GPU

starter

Model class: 1B-3B instruct models
Settings: Q4, short context, patience required

Try a tiny Qwen, Gemma, or Phi-class model before a 7B download.

Entry GPU

8GB VRAM

starter

Model class: 7B-8B at Q4
Settings: 4K-8K context; watch KV cache

Start with a recent 7B/8B chat model, then raise context only if memory stays stable.

Comfortable 7B

12GB VRAM

starter

Model class: 7B-14B at Q4/Q5
Settings: 8K context is realistic; 32K needs care

Use 7B/8B for speed, 14B when answer quality matters more than latency.

Strong local rig

16GB VRAM

starter

Model class: 14B at Q5 or some 32B at Q4
Settings: Keep headroom for context and apps

Benchmark both a fast 14B and a smaller-context 32B before choosing a daily driver.

Enthusiast GPU

24GB VRAM

starter

Model class: 32B at Q4/Q5
Settings: Good balance of quality and usable speed

Use 32B as the serious local baseline; treat 70B as a tradeoff experiment.

Large local models

32GB+ / unified

starter

Model class: 32B comfortably; 70B with tradeoffs
Settings: Leave OS and KV-cache headroom

Pick by task: coding, multilingual chat, long context, or agent serving.

ROUTERpick the job

What are you here to do?

Choose path → learn → act

FIRST RUN

Run my first local model

Install a runtime, pull a small model, confirm it answers locally, then know what to fix next.

Start the 10 min recipe →

WILL IT RUN

Find what fits my computer

Check model size, quantization, context, and VRAM before you waste a download.

Run a fit check →

MODEL PICK

Choose the right model

Compare model families, sizes, prompts, and evidence instead of chasing the largest parameter count.

Browse models →

HARDWARE

Pick or upgrade a GPU

Use measured local-AI suitability, VRAM headroom, and cost tradeoffs before buying hardware.

Open the hardware leaderboard →

BUILD

Build or serve an app

Move from demos into RAG, agents, serving, monitoring, and repeatable operator workflows.

Pick a course track →

BLK · CURRICULUMour own, written end to end

The written curriculum

The external syllabus explains the field. These are the RunLocalAI-native courses and task recipes that turn it into operator muscle memory.

Browse all recipes →

COURSES60

Step-by-step courses

The long form. Multi-chapter tracks that take you from first install to running open-weight models with intent — foundations, builder, operator.

Browse courses →

HOW-TO352

Task recipes

The short form. One job done end to end — steps, verification, and the failures that actually bite — for a specific local-AI task.

Browse how-to guides →

The shortest path, if you’re new

Don’t try to do the whole list. Do these three in order and you’ll know more about local AI than 95% of the people who use it daily.

STEP 01

Watch one Karpathy video

Pick Let’s build GPT from scratch. Two hours. Skips the math you don’t need; gives you the intuition you do.

STEP 02

Pick something that fits

Use /will-it-run to find a model + GPU combo that actually fits your VRAM. Don’t guess — the math is the math.

STEP 03

Use a tested system prompt

Every model is pickier than people assume. /prompting has tested kits per model — copy one before your first real conversation.

TREEoriginal router

The Local-Inference Learning Tree

The decision tree we wish someone had handed us. Identify the layer blocking you, follow three free resources, then land on a RunLocalAI surface that closes the loop.

Branch A3+1

I do not understand how it thinks

1Karpathy
23Blue1Brown
3Raschka

Use a prompt kit →

Branch B3+1

I do not know what hardware to buy

1Dettmers
2vLLM docs
3CS324 systems

Check model fit →

Branch C3+1

I want to deploy or serve this

1HF LLM Course
2fast.ai
3vLLM docs

Compare runtimes →

Branch D3+1

I want to evaluate models myself

1LMArena
2Open LLM Leaderboard
3CS324 evals

Read benchmark evidence →

EVIDENCEturn learning into data

Learn, run, measure, cite

A benchmark submission helps the next reader answer a concrete question: this model, on this GPU, with this runtime and quant, produced this many tokens per second. That is how the framework moves from fit estimates to evidence-backed verdicts.

Submit your benchmark

Help fill missing model x GPU x runtime cells with reproducible numbers.

Open →

Read the methodology

Understand confidence labels before citing a number.

Open →

Use the public API

Pull catalog and benchmark data into your own tools.

Open →

Compare local vs cloud

Translate learning into monthly cost and break-even math.

Open →

OP · PATHwhy this list exists

Operator path

runlocalai is built by Eruo Fredoline — an operator maintaining a local-AI hardware catalog, model library, fit framework, and measured benchmark table without a PhD or a research-lab budget. The list below is the syllabus that closed the gap between “I can install Ollama” and “I can defend every line on a /models page.” Each entry is hand-written. None of the links pay us. None of them are affiliate links. If a resource is missing from this list, it’s because we don’t recommend it — not because nobody offered us a kickback.

Theory we trust

Twelve free resources, grouped. Every note ends with “Where it lands for us” — the specific decision on runlocalai that this resource sharpened.

01 · FOUNDATIONS

Neural Networks: Zero to Hero

Andrej Karpathy · 8 YouTube videos · ~15 hours total

Builds a neural net, then a tiny GPT, in plain Python from scratch. The single best free intuition for why a transformer does what it does, and once you’ve watched the weights matter you understand why quantization is lossy at all. Where it lands for us: picking Q4_K_M vs Q6_K on /will-it-run/custom stops being a guess.

karpathy.ai/zero-to-hero.html

Let's reproduce GPT-2 (124M)

Andrej Karpathy · single YouTube video · ~4 hours

The follow-up that trains GPT-2 end to end. The systems detail — mixed precision, kernel fusion, FlashAttention — is what separates “I read the paper” from “I have shipped this.” Where it lands for us: explains why a 24GB card runs a 13B model at Q4 but chokes once context grows — the KV cache math gets concrete.

www.youtube.com/watch?v=l8pRSuU81PU

Neural Networks + Attention

3Blue1Brown (Grant Sanderson) · YouTube playlist + attention deep-dive · ~3 hours

The visual companion. Watch when you want geometric intuition for attention or backprop’s chain rule without writing a line of code. Pair with Karpathy — don’t substitute. Where it lands for us: the attention visualisation is why our /benchmarks numbers swing so much with context length.

www.3blue1brown.com/lessons/neural-networks

CS324 — Large Language Models

Tatsu Hashimoto, Percy Liang · Stanford · free lecture notes · self-paced

The most comprehensive free written treatment of pre-training, scaling, RLHF, and evaluation we’ve found. Reads like a textbook chapter, not a blog post. Where it lands for us: when a vendor claims their 70B model “beats GPT-4,” CS324 gives you the eval-design vocabulary to tell whether the claim is honest or staged.

stanford-cs324.github.io/winter2022

02 · PRACTICAL

Hugging Face LLM Course

Hugging Face · free, hands-on notebooks · ~20 hours

The course we’d recommend to someone who’d rather build than watch. Notebooks for tokenization, fine-tuning, evaluation — less theory, more code that runs on a free Colab GPU. Where it lands for us: recreate a tokenizer in the chapter on BPE and finally see why Qwen handles Chinese gracefully and Llama 3.x stumbles.

huggingface.co/learn/llm-course

Practical Deep Learning for Coders

Jeremy Howard · fast.ai · free course · ~30 hours

The top-down approach: ship a state-of-the-art image model in lesson 1, derive what makes it work over the rest. Slightly dated on language models specifically, but the engineering instincts transfer wholesale. Where it lands for us: fast.ai’s “always try the obvious thing first” mindset is why our /benchmarks report real tokens/sec rather than synthetic FLOPs.

course.fast.ai

03 · SYSTEMS & INFERENCE

llama.cpp — README + Discussions

ggml-org team · README + GitHub Discussions · skim-as-needed

Not a course. The README is short; the Discussions tab is where the actual engineering tradeoffs play out — GGUF format changes, quantization formats, kernel tuning per chip. Where it lands for us: when a quantization format changes, this is where it’s argued first, and we cite it on /tools/llama-cpp.

github.com/ggml-org/llama.cpp

vLLM documentation — paged attention

vLLM project · Sphinx docs · 2-3 hours focused

Skip the install section, read the architecture pages. Paged attention is the most consequential inference optimisation of the last three years; understanding it teaches why long contexts are expensive even at small batch sizes. Where it lands for us: maps directly to why /hardware pages surface both memory bandwidth and capacity — both matter, for different reasons.

docs.vllm.ai

Quantization deep-dives

Tim Dettmers · blog · ~2 hours per post

Dettmers wrote bitsandbytes and did the original LLM.int8() work. The clearest posts on quantization tradeoffs anywhere for free. Start with his 4-bit piece. Where it lands for us: every variants table on a /models page shows Q4_K_M / Q6_K / Q8_0 — Dettmers explains why those specific grades exist and which to pick.

timdettmers.com

04 · ALIGNMENT & EVALUATION

RLHF & DPO from scratch

Sebastian Raschka · blog (Ahead of AI) · ~1 hour per article

The cleanest from-scratch explanations of RLHF and DPO outside the original papers. Working-engineer level, not researcher. Where it lands for us: explains why DeepSeek R1 explicitly recommends no system prompt — it’s a quirk of its preference-tuning setup, and we surface it as a kit caveat on /models/deepseek-r1.

magazine.sebastianraschka.com

Chatbot Arena + Open LLM Leaderboard

LMSYS · Hugging Face · live leaderboards · ~10 min orientation

Two complementary surfaces. Arena is crowdsourced human preference — hardest to game. The Open LLM Leaderboard is benchmark-based — easier to game, useful for capability cuts (math, code, reasoning). Where it lands for us: our /benchmarks page measures inference speed; Arena/Leaderboard cover model quality. Use both. Neither alone is enough.

lmarena.ai

Attention Is All You Need

Vaswani et al. (2017) · arXiv paper, 11 pages · 1 careful pass

The paper that started the transformer era. Surprisingly readable. Worth a careful read after Karpathy’s GPT-from-scratch video — the paper makes more sense once you’ve built the thing from the ground up. Where it lands for us: every architecture decision since 2017 references this paper; reading it once means the rest of the field stops feeling like jargon.

arxiv.org/abs/1706.03762

BRIDGESwhere theory meets ops

Six topical deep-dives

~120 words each

BRIDGE · A

Why quantization (mostly) works

Weights are stored as 32-bit or 16-bit floats during training. At inference you can re-encode them to 4-bit or 5-bit integers with surprisingly little quality loss — networks are over-parameterised, and most weights cluster around small values that compress well. Q4_K_M stores weights in groups of 256 with per-group fp16 scales, preserving the dynamic range that matters.

Quantization formats vs FP16 baseline (7B model, approximate)
Format	Bits/weight	File size	MMLU delta
FP16	16	14 GB	baseline
Q8_0	8.5	7.3 GB	−0.1%
Q6_K	6.6	5.7 GB	−0.5%
Q5_K_M	5.7	4.9 GB	−1.2%
Q4_K_M	4.8	4.2 GB	−2.1%
Q4_0	4.5	3.9 GB	−3.5%
Q3_K_M	3.9	3.2 GB	−5.8%
Q2_K	3.0	2.5 GB	−12%

Source: llama.cpp benchmarks · Dettmers analyses

Q4_K_M block layout (per 256 weights)

[ block scale ][ block min ][ 256 × 4-bit quantized weights ]
       fp16         fp16              128 bytes

Operator takeaway: Q4_K_M is our default recommendation on /models for chat. For code-heavy or math-heavy use we bump to Q5_K_M or Q6_K because the accuracy delta there grows roughly 2× faster than on general MMLU.

BRIDGE · B

What VRAM actually holds

A common surprise: an 8GB GPU loads a 7B Q4 model (~4GB of weights) and then OOMs once the conversation grows. That’s the KV cache — every token in context stores key+value vectors for every attention layer, and it scales linearly with context length.

KV cache formula (per request)

KV_bytes = 2 × num_layers × num_kv_heads × head_dim
         × context_tokens × precision_bytes

Llama 3.1 8B (32 layers, 8 KV heads, head_dim 128, fp16):
  2 × 32 × 8 × 128 × ctx × 2  =  131,072 bytes per token

  At  8K ctx →  1.07 GB
  At 32K ctx →  4.29 GB
  At 128K ctx → 17.18 GB  ← exceeds a 16GB GPU alone

VRAM accounting · Llama 3.1 8B · Q4_K_M
Component	8K ctx	32K ctx	128K ctx
Weights	4.7 GB	4.7 GB	4.7 GB
Activations	~0.5 GB	~0.5 GB	~0.5 GB
KV cache	1.07 GB	4.29 GB	17.18 GB
Framework overhead	~0.5 GB	~0.5 GB	~0.5 GB
TOTAL	≈ 6.8 GB	≈ 10.0 GB	≈ 22.9 GB

Source: derived from Llama 3.1 8B config + llama.cpp defaults

Operator takeaway: when /will-it-run flags your rig as “fits at 8K but not 32K,” the KV cache is what changed — not the weights. vLLM’s paged attention is the engineering trick that makes long-context serving viable; it’s why production inference uses vLLM and your laptop uses llama.cpp.

BRIDGE · C

Tokenizers → multilingual quality

A tokenizer’s vocabulary determines how efficiently your model reads each language. BPE trains its merges on the corpus, so a tokenizer trained mostly on English splits Chinese into many more tokens than a multilingual tokenizer.

Tokens per 1,000 characters (approximate)
Language	Llama 3.x	Qwen 3	Gemma 3
English	~250	~245	~248
Spanish	~280	~265	~270
Code (Python)	~210	~200	~205
Chinese	~750	~250	~265
Yoruba	~620	~340	~310

Source: approximate; varies with content. Lower = better

BPE merge example: building one token from seven

Start:    [r] [u] [n] [n] [i] [n] [g]
Merge 1:  [r] [u] [n] [n] [in] [g]     'in' is frequent
Merge 2:  [r] [u] [n] [n] [ing]        'ing' is frequent
Merge 3:  [r] [u] [n] [ning]
Merge 4:  [running]                    one token = seven characters

Operator takeaway: more tokens per word = slower inference, more VRAM, and worse quality (the model spends attention budget on the same idea). That’s why Qwen 3 (119 training languages) handles many low-resource languages more efficiently than Llama 3.x-family tokenizers, and why our /prompting hub flags chat-template differences per family.

BRIDGE · D

Chinchilla → “should I wait for a smaller model?”

The Chinchilla paper (DeepMind, 2022) showed that for a fixed compute budget, smaller-model-more-data beats bigger-model-less-data. Rule of thumb: ~20 tokens of training data per parameter for compute-optimal training. Modern releases blow past that ratio.

Tokens per parameter over time (trend matters more than exact values)
Model	Date	Params	Train tokens	Tokens/param
GPT-3	2020-06	175B	300B	1.7
Chinchilla	2022-03	70B	1.4T	20.0
Llama 2 70B	2023-07	70B	2.0T	28.6
Llama 3 8B	2024-04	8B	15T	1,875
Llama 3.3 70B	2024-12	70B	15T	214

Source: vendor announcements + model cards

Operator takeaway: smaller models keep getting better faster than they shrink. A modern 8B can now beat much larger older releases on the practical tasks people actually run locally. If you’re VRAM-constrained, this is the best news local AI has. We track new releases on /models/new with each row’s tokens-per-param surfaced so you can spot the over-trained underdogs.

BRIDGE · E

Why instruct ≠ base

A base model predicts the next token, period. An instruct model has been fine-tuned in two stages: supervised fine-tuning on curated examples, then preference tuning via RLHF (reward model + RL) or DPO (a direct optimisation shortcut).

Same prompt, base vs instruct

Prompt: "What is 12 × 17?"

Llama 3 8B (base):
  "What is 12 × 17? What is 13 × 17? What is 14 × 17? ..."
  (continues the apparent pattern — pure next-token prediction)

Llama 3 8B Instruct:
  "12 × 17 = 204. Computing: 12 × 17 = 12 × (10 + 7)
   = 120 + 84 = 204."
  (follows the implicit instruction to answer)

The DPO loss (a single elegant equation)

L_DPO = -log σ( β · log π(y_w|x)/π_ref(y_w|x)
                 - β · log π(y_l|x)/π_ref(y_l|x) )

where:
  π     = the model being optimized
  π_ref = a frozen reference (usually the SFT model)
  y_w   = preferred response  (winner)
  y_l   = dispreferred response (loser)
  β     = a tuning constant, typically 0.1–0.5

In English: push the model toward chosen responses and away
from rejected ones — while staying close to the reference.

Operator takeaway: the preference-tuning recipe shapes a model’s quirks more than its size does. DeepSeek R1 was tuned to reason between <think> tags, so a system prompt actually degrades its performance. Our /prompting kits surface these quirks per model so you don’t fight the training.

BRIDGE · F

Why “vibes-check” beats perplexity

Perplexity (how well a model predicts the next token) was the original metric. It’s still useful for pre-training research; it’s nearly useless for choosing a model to use. Modern evaluation splits two ways: benchmark suites and human-preference ranking. Neither alone is enough.

Four eval lenses — they answer different questions
Lens	What it catches	What it misses	Use it for
Perplexity	Next-token fit	Instruction quality	Pre-training sanity checks
Task benchmarks	Math, coding, knowledge	Your exact workflow	Capability screening
Human preference	Conversation quality	Verbosity and style bias	Chat model shortlists
Local tok/s	Runtime reality	Model intelligence	Will-It-Run verdicts

Source: methodology summary; cite source leaderboards for current numeric scores

Operator takeaway: same model, different rank, different conclusion depending on which lens you pick. Use multiple. Our /benchmarks measures local inference speed (the rightmost column) because that’s our lane — we link to Arena and the Leaderboard for quality so you have both.

PATHSby goal

Reading paths

“I want to understand what’s happening when I prompt Llama”

Karpathy: build GPT→
3Blue1Brown: attention→
Raschka: RLHF/DPO→
Use a local LLM prompt kit

“I want to actually build or fine-tune something”

“I want confident hardware decisions”

Dettmers on quant→
vLLM paged attention→
CS324 systems lecture→
Rank GPUs for local AI

“I want to benchmark my own machine”

NEXTwhere to go after this page

Next tracks we recommend

Multimodal vision / audio: the Hugging Face diffusion course is the cleanest free start; then use our tools directory to choose a local runtime.
Production serving at scale: vLLM docs above are the floor; the SkyPilot blog and Anyscale’s docs cover the ceiling.
Agents and tool calling: still chaotic; the OpenAI cookbook and Anthropic’s docs are the most stable references. We surface per-model tool-call formats in /prompting kits.
The math of training: Karpathy covers it well enough that we never had to leave his videos. If you do, the back half of CS324 has the references; then come back to /benchmarks and test whether the local model is fast enough.

CITEreference asset

Cite the Will-It-Run learning framework

Use the framework page when you need a compact citation for local-AI fit: model quality is not enough; the useful answer combines VRAM fit, quantization, context, speed evidence, and cost.

Open framework →

Title: The RunLocalAI Will-It-Run Framework
Author: Eruo Fredoline
Reviewed: 2026-05-29
Canonical: https://www.runlocalai.co/resources/will-it-run-framework

Suggested citation: RunLocalAI, "The RunLocalAI Will-It-Run Framework," reviewed 2026-05-29, https://www.runlocalai.co/resources/will-it-run-framework.

LOGpage history

Changelog

2026-05-29v1.2 — promoted Learn from reading list to learning hub: added memory-first fit paths, intent router, responsive learning tree, evidence flywheel, framework citation, stronger Will-It-Run SEO, and accessibility hardening.
2026-05-23v1.1 — added structured data (Course + ItemList + BreadcrumbList), expanded all 6 topical bridges with tables and worked examples, added the Local-Inference Learning Tree, added changelog + Last Reviewed.
2026-05-23v1.0 — page launched: 12 curated resources, 6 topical bridges, 3-step shortest path, operator-path callout, reading paths by goal.

Next link-health check: 2026-08-29 · flag bad links to /contact