Operator path
Operator-reviewed

Beginner: run your first local model

For: Anyone with a Mac or Windows laptop and zero local-AI experience. By the end: A 7B-class model running locally, an OpenAI-compatible endpoint pointed at it, and the vocabulary to read benchmark pages.

By Fredoline Eruo7 milestonesLast reviewed 2026-05-07

You're on a Windows or Mac laptop, you've heard about local AI, and you don't have a 4090 sitting in your closet. That's fine — the goal of this path is to get a small, real model running on what you already own, then teach you the vocabulary you need to decide whether to invest in more hardware. By milestone seven you will have a 7B-class model running through an OpenAI-compatible endpoint, the ability to read a benchmark page without panic, and a working sense of which quantization to pick.

Install Ollama and load a 3B model

Ollama is the easiest first install. One binary, one command to pull a model, and a working REPL. On a 16GB laptop you can comfortably fit a 3B-class model like Llama 3.2 3B or Qwen 2.5 3B — these are not toys, they handle simple summarization, classification, and small Q&A reliably.

Don't reach for the 70B model. Don't try Mixtral. The point of this milestone is to get a successful end-to-end run and confirm your machine is capable of running anything at all. If Ollama itself can't load a 3B model in Q4, the rest of the path will not save you.

When this is done you should have
Ollama running locally with at least one model pulled, responding to a chat prompt in under 2 seconds per token.

Read your first benchmark page honestly

You will spend the rest of your local-AI life reading benchmark pages and you should learn to read them now. The two things to internalize: tokens-per-second is workload- dependent (prompt size, batch size, quantization, runtime), and a "B" in the model name is parameter count, not memory. A 7B model in FP16 needs ~14GB; in Q4 it needs ~4GB. Same model, totally different deployment story.

Before continuing, browse three benchmark pages on /benchmarks and identify the runtime, the quantization, and the hardware on each. If you can't, re-read this milestone. Don't skip — every later milestone assumes you can.

When this is done you should have
The ability to look at a /run/[model]/on/[hardware] page and explain to a friend why a 7B Q4 number isn't comparable to a 7B FP16 number.

Move up to a 7B model and feel the cost

Now pull a 7B-class model: Llama 3.1 8B Instruct or Qwen 2.5 7B. Watch what happens when you ask it for a 500-word answer. On most laptops without a discrete GPU, you will see something between 5 and 15 tokens per second. That's slow enough to feel, fast enough to use for small focused tasks.

This is the milestone where most people decide whether local AI is for them. If 8 tok/s is fine for what you want to do (drafting, classification, code review on small files), you have your answer. If it isn't, you have learned a real fact about your hardware.

When this is done you should have
A 7B-class model loaded and answering. Tokens-per-second number written down. Memory pressure observed in your task manager / Activity Monitor.

Understand quantization without the math

Every model on Hugging Face ships in many sizes — Q4_K_M, Q5_K_M, Q6_K, Q8_0, FP16. These are quantization formats and they trade memory for accuracy. Don't memorize the formulas; memorize the heuristic: start at Q4_K_M, move up to Q5_K_M if outputs feel dumb, move up to Q8 if you have the memory and accuracy matters.

The VRAM calculator gives you the math when you need it. For now, the rule is: pick the largest quant that leaves you 2-4GB of headroom on top of the model itself.

When this is done you should have
A working mental model: Q4_K_M is the default sweet spot, Q5_K_M is sharper at a cost, Q8 is for when accuracy matters more than memory.

Switch from Ollama to llama.cpp directly

Ollama is a wrapper around llama.cpp. Once you've gotten comfortable, learning to run llama.cpp directly buys you real control: you can pick exact build flags, point at quants Ollama doesn't ship, and read meaningful error messages when things break. This is the upgrade from "operator who can run a model" to "operator who can debug their stack."

The OpenAI-compatible HTTP server on port 8080 is the interface. Every editor extension, every agent, every framework speaks that protocol. You are now plug-compatible with the rest of the local-AI ecosystem.

When this is done you should have
A llama.cpp server running, OpenAI-compatible endpoint at port 8080, your old Ollama models still working.

Connect a real client to your local endpoint

Pick one front end. Open WebUI gives you a ChatGPT-style web app pointed at your local server. LM Studio gives you a polished desktop app with a model browser. Both work; pick the one that fits your habits. The point isn't the front end — it's that you've now run an end-to-end local AI stack and used it for something real.

When this is done you should have
Open WebUI or LM Studio (or your editor) successfully talking to your local server. The model responds. Your stack works end-to-end.

Decide what's next, with eyes open

You've now run small models on what you have. You know what 8 tok/s feels like, you know which use cases your hardware can serve, and you know what the upgrade options are. If you want bigger models or higher throughput, the next path is hardware-shaped: a discrete GPU (the GPU chooser is the right next stop) or a higher-memory Mac.

Or: stop here. The 7B model on your laptop is a real tool for plenty of tasks. Knowing when to stop is its own operator skill.

When this is done you should have
A clear answer to the question 'do I need more hardware?' grounded in actual experience, not marketing copy.

Next recommended step

Now that you've felt your laptop's ceiling, the chooser walks you through picking a GPU based on the workload you've actually validated.