RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /What is Local AI — And Why It Matters
  6. /Ch. 3
What is Local AI — And Why It Matters

03. What Can You Actually Run Locally?

Chapter 3 of 20 · 18 min
KEY INSIGHT

Local AI runs a wide capability range from tiny models (1B params) to large ones (70B+), and the right choice depends on your hardware—most users can run capable 7B models with 4-bit quantization on standard laptops.

The Capability Range

Local AI isn't one thing—it's a spectrum of capability depending on your hardware. The same model runs differently on different machines, and different models have different requirements.

Small models (0.5B - 3B parameters):

  • Run on CPUs with 8GB+ RAM
  • Accessible to almost any modern machine
  • Good for simple tasks: summarization, classification, basic generation
  • Example models: Phi-2, TinyLlama, Qwen2-0.5B

Medium models (7B - 13B parameters):

  • Run on modern laptops with 16GB+ RAM (using quantization)
  • Run well on gaming GPUs (GTX 1060, RTX 3060, and better)
  • Capable of coherent conversation, code generation, writing assistance
  • Example models: Llama 3.2 7B, Mistral 7B, Phi-3.5 7B

Large models (30B - 70B parameters):

  • Require dedicated GPU with 12GB+ VRAM minimum
  • Top performance for local: approaches cloud quality on many tasks
  • Example models: Llama 3.1 70B, Mistral Large, Command R+

Very large models (100B+ parameters):

  • Require high-end GPUs (RTX 4090, A100) with significant VRAM
  • Quantization essential—even then, hardware is a serious constraint
  • For most users, this isn't practical yet

Real-World Performance Expectations

Here are concrete numbers from real hardware:

On a laptop with 16GB RAM and integrated graphics (no GPU):

  • TinyLlama (1.1B): 5-10 tokens/second
  • Phi-2 (2.7B): 3-6 tokens/second
  • Usable for simple tasks with patience

On a mid-range gaming PC with RTX 3060 (12GB VRAM):

  • Llama 3.2 7B (Q4 quantization): 20-30 tokens/second
  • Mistral 7B (Q4): 25-35 tokens/second
  • Real-time conversation feel

On a high-end GPU (RTX 4090, 24GB VRAM):

  • Llama 3.2 13B (Q4): 40-60 tokens/second
  • Llama 3.1 70B (Q4): 15-25 tokens/second
  • Capable for serious work

Token speed guide:

  • <5 tok/s: noticeably slow, but usable for non-interactive tasks
  • 5-15 tok/s: conversational with slight delay
  • 15-30 tok/s: feels responsive
  • 30+ tok/s: feels snappy

Practical Recommendations

For a first local AI setup:

  1. Check your hardware first (we'll cover this in Chapter 7)
  2. Start with a 7B model - good capability-to-requirements ratio
  3. Use 4-bit quantization - significant quality for minimal hardware hit
  4. Adjust expectations - local 7B != cloud 70B, but it's close enough for many tasks

The question isn't "can I run local AI" but "which model can I run that will be useful for my needs." For most users, the answer is "yes, a 7B model with quantization."

EXERCISE

Run a benchmark on your current machine. Use a tool like ollama (we'll install it in Chapter 8) and run a simple prompt with a small model, measuring how many tokens per second you get. This gives you real numbers for what your hardware can handle.

← Chapter 2
The Two Worlds - Cloud vs Local
Chapter 4 →
What is a Model, Really?