Beginner guide

Can I run AI locally on my computer?

The literal answer to the literal question. What counts as local AI, the minimum hardware floor, the tier-by-tier capability ladder from 4 GB CPU laptops up to 24 GB GPU desktops, the 5-minute Ollama path, realistic tok/s ranges, and what you should not try on weak hardware.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

The short answer

Almost certainly yes. If your computer was made in the last six years and has at least 8 GB of RAM, you can run a small language model locally today. If you have a discrete GPU with 6 GB of VRAM or more, you can run a model that genuinely competes with mid-tier cloud chat for most everyday tasks. The only computers that genuinely cannot run any local AI are sub-4 GB Chromebooks, ten-year-old netbooks, and locked-down corporate laptops where you cannot install software. Everything else is on a sliding scale of which models run, not whether any of them do.

The honest goal of this page is to tell you, in 5 minutes, what your specific machine can run and what to install first. If you already know your CPU, RAM, and GPU, jump straight to /will-it-run/custom and enter them — you will get a per-model verdict in seconds.

What counts as “running AI locally”

“Local AI” means the model weights live on your disk and the inference happens on your CPU or GPU. No data leaves the machine. There is no API key, no monthly bill, no rate limit, no internet requirement after the initial download. The model file (typically 2-40 GB) sits in a folder, a runtime program loads it, and you talk to it through a chat window or a command-line prompt.

Three categories you should keep distinct so you don't get confused by marketing:

Pure local LLMs. Llama 3, Qwen 2.5, Mistral, Phi-4, Gemma — the open-weight models you download and run with Ollama, LM Studio, or llama.cpp. This is what most people mean.
Local image / audio / video models. Stable Diffusion, Whisper, AudioGen, Wan2.2. Same principle, different modality. Heavier hardware floor — image models typically want 8 GB+ VRAM.
Hybrid “local-feeling” products. Microsoft Copilot+, Apple Intelligence, Google Pixel AI features. These run small models on-device but escalate to a cloud server for harder queries. They are not what people on r/LocalLLaMA mean by “local” — but they prove the consumer-hardware path works.

The rest of this guide is about the first category — running open-weight LLMs you fully control.

The minimum hardware floor

The single most useful number to know is your RAM (if you have no discrete GPU) or your VRAM (the memory built into your GPU, if you have one). Models live in memory while they run. A model that doesn't fit in memory either won't load or will swap to disk and slow to a crawl.

4 GB total RAM, no GPU. Edge of viable. You can run Phi-3 Mini Q4 (about 2.3 GB on disk) at 3-8 tokens per second on a modern laptop CPU. Useful for short questions. Anything larger will swap. Honest verdict: try it for fun, don't expect production work.
8 GB total RAM, no GPU. Mainstream floor. Llama 3.2 3B and Qwen 2.5 7B Q4 both fit. Expect 5-15 tok/s on recent Intel/AMD CPUs, 10-25 tok/s on Apple Silicon (the unified-memory advantage is real here). This is where most people start.
16 GB total RAM, no GPU. Comfortable for 7-8B class models, possible for 13-14B. CPU inference still tops out around 10-15 tok/s on x86; Apple M-series can hit 20-40 tok/s on the same models because of memory bandwidth.
6-8 GB VRAM (RTX 3050 / 4060 / RX 7600). The first tier where using a local model feels comparable to a hosted chat experience. 7-8B Q4 models run at 30-80 tok/s. This is where most people stop noticing the latency.
12-16 GB VRAM (RTX 3060 12GB / 4060 Ti 16GB / 4070). The sweet spot. 14B Q4 models comfortably; 32B Q4 with tight context. This is the tier where local AI replaces a paid chat subscription for most everyday tasks.
24 GB VRAM (RTX 3090 / 4090 / 7900 XTX). 70B-class Q4 fits with usable context. You stop apologizing for the model.

Apple Silicon deserves a separate mention because its unified-memory architecture skips the VRAM/RAM split entirely. An M2 MacBook Air with 16 GB unified memory runs the same 8B models as a desktop with a 3060, and a Mac Studio with 64-128 GB can run models that no consumer NVIDIA card can hold.

The capability ladder by tier

Once you know your tier, here is approximately what the experience feels like. Numbers are honest ranges across recent reports — your mileage will vary by 20-30% depending on quantization, context length, and ambient temperature.

Tier 1 — CPU only, 8 GB RAM. Best model: Phi-4 Mini or Qwen 2.5 3B Q4. Use case: drafting emails, simple summaries, quick lookups. Tok/s: 8-20. Don't try: code generation past a few hundred lines, reasoning chains, anything with a 20K+ context.
Tier 2 — CPU only, 16 GB RAM, or 6-8 GB VRAM. Best model: Llama 3.1 8B or Qwen 2.5 7B Q4. Use case: general chat, document Q&A on short documents, simple code completion. Tok/s: 15-50. Don't try: long-context book-length analysis.
Tier 3 — 12-16 GB VRAM. Best model: Qwen 2.5 14B Q4 or Mistral Small. Use case: this is where local AI feels “real.” Code that compiles, multi-step reasoning, structured output. Tok/s: 30-80. Don't try: 100K-token context windows on the same hardware.
Tier 4 — 24 GB VRAM or M-series 32-64 GB unified. Best model: Llama 3.3 70B Q4, Qwen 2.5 32B AWQ, or DeepSeek V3 distillations. Use case: replaces ChatGPT Plus for most non-frontier tasks. Tok/s: 15-60 depending on model size.
Tier 5 — multi-GPU or M3 Ultra 192-512 GB. Best model: full-precision 70B+ or 100B+ MoE. Use case: research, fine-tuning, agent workflows. This guide does not address this tier — see /guides/running-local-ai-on-multiple-gpus-2026.

The 5-minute Ollama install path

The fastest way from “is this possible?” to “I have a model running” is Ollama. It abstracts away the hard parts (quantization choice, runtime flags, kernel selection) and ships sensible defaults. Five concrete steps, any platform:

Download Ollama from ollama.com. macOS gets a drag-to-Applications installer; Windows gets a .exe; Linux is a one-line shell install.
Open a terminal and run ollama pull llama3.2:3b. This downloads about 2 GB. Substitute llama3.1:8b if you have 8 GB+ VRAM, or qwen2.5:14b if you have 12 GB+.
Run ollama run llama3.2:3b. You are now in a chat session with a local model. No internet required from here.
Optional: install LM Studio if you want a GUI instead of a terminal. It plugs into the same model files and is friendlier for non-developers.
Optional: install Open WebUI if you want a ChatGPT-style web interface that talks to the Ollama server you just installed.

If anything fails, the fix is almost always one of three things — see /errors for the full taxonomy. The most common: not enough VRAM (the runtime falls back to CPU and feels slow), GPU drivers out of date, or — on Windows AMD — ROCm not properly installed.

What tok/s actually means and what to expect

“Tokens per second” is the unit you will see everywhere. A token is roughly 0.75 of an English word. So 30 tok/s is reading-speed; 60 tok/s is fast-typing-speed; anything above 100 tok/s feels instantaneous. Below 10 tok/s feels painful for chat but is fine for batch jobs (summarize 100 documents overnight).

Two numbers actually matter: time-to-first-token (latency before output starts) and generation tok/s (sustained throughput once it does). On CPU-only setups, time-to-first-token grows linearly with the prompt length and can take 5-30 seconds for long inputs. On GPU setups it is typically under a second.

What NOT to run on weak hardware

Avoiding frustration is half the battle. Specific anti-recommendations:

Do not run 70B models on 16 GB or less. They will swap to disk. You will get 0.5-2 tok/s. The model will be technically running. The experience will be miserable.
Do not use Q2 quantization to fit a model that almost fits. The quality cliff at Q2 is severe — output becomes incoherent on hard prompts. Stay at Q4_K_M or above. See /systems/quantization-formats.
Do not try image generation on a CPU. Stable Diffusion on a CPU is 30-60 seconds per image at low quality. The same model on a 12 GB GPU is 1-3 seconds. The hardware floor is real.
Do not run two models simultaneously on a single 8 GB GPU. They will fight for memory; one will swap; both will be slow. Pick one model and let it own the GPU.
Do not enable 128K-context loading on a 12 GB card. The KV cache for long context is enormous — sometimes larger than the model itself. Stay under 16K context unless you measured the math first.

Where to go next

Three concrete next steps depending on where you landed:

If you don't know your hardware specs: open Task Manager (Windows) or About This Mac (macOS), find your RAM and GPU, then run /will-it-run/custom.
If you know your specs and want a tool roundup: read /guides/free-ai-tools-that-run-on-your-computer.
If you're thinking about upgrading: /guides/best-hardware-for-running-local-ai-models walks through the buying ladder from $0 to $4000.

The bigger picture: in 2026 the answer to “can I run AI locally?” is essentially always yes. The interesting question is which model on which hardware for which task — and that is what the rest of this site exists to answer. Browse /models for the model catalog, /hardware for the hardware catalog, or /setup for the path-finder that combines them.

The answer almost always starts with a single question: how much VRAM does your machine have? A dedicated GPU with 12 GB opens the door to quantized 13B models running at reading speed. Even integrated graphics on a modern laptop can handle 3B models for summarization and drafting. The tool that actually tells you what your specific hardware can run removes the guesswork entirely before you spend a cent.

Start with the hardware baseline: best budget GPU for local AI, and hardware compatibility lookup.