Running AI locally
~12 minute read · written for someone who has never installed a model
ChatGPT and Claude run on someone else's computer. Every message you send is a network round-trip to a datacenter, where the model reads your prompt, the company logs the conversation, and the response comes back. You pay a subscription or per-token API fee for this.
Local AI means the model file lives on your laptop or desktop, and the same conversation never leaves your machine. The model isn't as large as the frontier APIs, but for everyday work — chat, code, summarization, translation — an open-weight model from 2025 onward gets you most of the way for free, in private, with no internet required after you download it.
This page explains how. The vocabulary is simple. The first working setup takes about five minutes once your computer is ready.
The minimum working setup
Install Ollama from ollama.com/download. It runs on Windows, macOS, and Linux and takes under a minute. Then open a terminal — "Terminal" on macOS and Linux, or "PowerShell" on Windows — and run:
$ ollama run llama3.1:8b pulling manifest pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB verifying sha256 digest writing manifest success >>> Hello Hello! I'm an AI assistant running on your computer. What can I help you with today? >>>
The first run downloads the model file (~4.7 GB), which takes a few minutes on typical home internet. Subsequent runs start immediately. Type /bye to exit, /help for built-in commands. Run the same command again later to come back.
That is the entire installation. No accounts, no API keys, no configuration files. The model is a file on disk and Ollama is the program that reads it.
What is actually happening
An AI model is a single file containing several billion numbers, called weights. Those numbers were computed once during training, on a cluster of expensive datacenter GPUs, by reading a substantial fraction of the public internet. After training, the file is static. Running a model means feeding your prompt through those numbers in a particular sequence to predict the next word, then the next, until a complete response has been built.
The program that actually does that prediction work — read the file, do the matrix math, output text — is the runtime. Ollama is one of several. Others include llama.cpp (the engine inside Ollama, exposed directly), LM Studio (a graphical app), vLLM (production-grade serving), and MLX (Apple Silicon). Different runtimes trade off ease, speed, and hardware support, but the model file itself is portable across most of them.
The model is named by its parameter count. Llama 3.1 8B means the file contains roughly eight billion numbers. More parameters generally means better answers, larger file size, and more memory required at runtime. The 8B class is a useful default: it fits on consumer laptops and is competent at everyday work. The 70B class is closer to GPT-4 in quality but requires a serious GPU or a Mac Studio. Below 8B you reach a tier suitable for embedded and edge use but not general chat.
What hardware you need
The constraint is memory. The model file has to fit, with room for the conversation context on top. Quantization — storing the weights at lower precision — shrinks the file by 2–4× with a small quality cost.
| MODEL TIER | Q4 SIZE | MIN RAM | USABLE WITHOUT GPU? | EXAMPLE |
|---|---|---|---|---|
| 3B | ~2 GB | 8 GB | yes | Llama 3.2 3B |
| 7–8B | ~5 GB | 16 GB | tolerable | Llama 3.1 8B |
| 14B | ~9 GB | 16 GB | slow | Qwen 3 14B |
| 32B | ~20 GB | 32 GB | no | Qwen 3 32B |
| 70B | ~40 GB | 64 GB | no | Llama 3.3 70B |
A modern laptop with 16 GB of RAM is the practical floor. The 8B tier runs on CPU at roughly 5–15 tokens per second, which is slow but usable for non-interactive tasks. Adding a graphics card lifts that to 30–60 tokens per second on a used RTX 3060 12GB for around 200 USD. Apple Silicon Macs do not need a separate GPU because the unified memory architecture lets the SoC use system RAM as VRAM directly. An M4 Max with 64 GB runs 70B models at usable speed without any external accelerator.
Beyond 70B, you are in workstation or datacenter territory: a RTX PRO 6000 Blackwell with 96 GB, or rented cloud GPUs. For a precise answer about your specific machine, see Will it run?
What to do with it
Once the prompt is open, treat it as a private ChatGPT. The model answers questions, drafts text, explains concepts, summarizes, and translates. Beyond the chat prompt, Ollama exposes an HTTP API on port 11434 that any program on your machine can talk to. This is what makes local AI useful for real work, not just experiments.
For coding, Continue is a VS Code extension that uses your local model the way GitHub Copilot uses OpenAI. For agentic, multi-file code edits, Aider runs in your terminal. For document Q&A, RAG pipelines built on BGE-M3 embeddings let you ask questions of your own files. For image generation, ComfyUI with Flux produces results that compare to commercial tools.
For a complete index of capabilities and the recommended models, hardware, and runtimes for each, see /tasks — 94 entries spanning text, vision, image, video, audio, 3D, coding, RAG, agents, mobile, and scientific workloads.
What breaks
Out-of-memory errors are the most common failure. The model is too large for the available RAM, the runtime swaps to disk, and either the response generation slows to a crawl or the process is killed. Fix by choosing a smaller model (llama3.1:8b instead of llama3.1:70b) or a more aggressive quantization (llama3.1:8b-instruct-q4_K_S instead of the default).
Slow first response is normal. Loading the model into memory and warming up takes several seconds; subsequent messages in the same session are much faster. Run Ollama as a persistent server (it does this by default) so the model stays warm.
The model invents facts. All language models hallucinate, including the API ones. Treat outputs as drafts to verify, not authoritative answers. For factual work, attach a retrieval pipeline so the model cites real sources from your own documents — see private document analysis.
It does not remember anything between sessions. Each new prompt starts with a fresh context window. Long-running memory is a separate problem — see agent memory systems for the tooling.
The output is censored or refuses tasks. Open models ship with their own safety alignment. If a base model refuses to help with something legitimate, swap to a less-aligned variant (Hermes, Dolphin, Wizard finetunes). Most refusals are over-cautious calibration, not a hard block.
When local is the wrong answer
For frontier reasoning — the hardest math problems, complex multi-step planning, novel research-level questions — the closed flagship APIs (GPT-5, Claude 3.7 Sonnet, Gemini 2.5) still outperform anything you can run locally. The gap on these tasks is real, not marketing. If your work is frontier-tier reasoning, mix a local model for everyday work with API access for the 5% of tasks that demand the frontier.
For interactive multi-user serving, local is the wrong shape. A consumer GPU serves one user well; serving 50 concurrent users requires a different runtime (vLLM or SGLang) and a datacenter card.
For hardware you do not own — a borrowed laptop, a corporate machine with locked admin, a Chromebook — the path is to rent a GPU. Cloud rental at 0.50–4.00 USD per hour gets you the same models on real hardware, billed by the minute.
References
Each of the following is a standalone reference. They assume you've read this primer.
- /tasks — what local AI can do, organised by capability
- /models — which model to use, with verdicts
- /families — the families and how they evolved
- /hardware — what to buy, by tier and use case
- /tools — runtimes, GUIs, agents, IDE plugins
- /will-it-run — calculator: enter your hardware, get a sized answer
- /pulse — what changed in the ecosystem this week
- /glossary — every term defined