What Can You Actually Run Locally? — What is Local AI — And Why It Matters (Chapter 3)

The Capability Range

Local AI isn't one thing—it's a spectrum of capability depending on your hardware. The same model runs differently on different machines, and different models have different requirements.

Small models (0.5B - 3B parameters):

Run on CPUs with 8GB+ RAM
Accessible to almost any modern machine
Good for simple tasks: summarization, classification, basic generation
Example models: Phi-2, TinyLlama, Qwen2-0.5B

Medium models (7B - 13B parameters):

Run on modern laptops with 16GB+ RAM (using quantization)
Run well on gaming GPUs (GTX 1060, RTX 3060, and better)
Capable of coherent conversation, code generation, writing assistance
Example models: Llama 3.2 7B, Mistral 7B, Phi-3.5 7B

Large models (30B - 70B parameters):

Require dedicated GPU with 12GB+ VRAM minimum
Top performance for local: approaches cloud quality on many tasks
Example models: Llama 3.1 70B, Mistral Large, Command R+

Very large models (100B+ parameters):

Require high-end GPUs (RTX 4090, A100) with significant VRAM
Quantization essential—even then, hardware is a serious constraint
For most users, this isn't practical yet

Real-World Performance Expectations

Here are concrete numbers from real hardware:

On a laptop with 16GB RAM and integrated graphics (no GPU):

TinyLlama (1.1B): 5-10 tokens/second
Phi-2 (2.7B): 3-6 tokens/second
Usable for simple tasks with patience

On a mid-range gaming PC with RTX 3060 (12GB VRAM):

Llama 3.2 7B (Q4 quantization): 20-30 tokens/second
Mistral 7B (Q4): 25-35 tokens/second
Real-time conversation feel

On a high-end GPU (RTX 4090, 24GB VRAM):

Llama 3.2 13B (Q4): 40-60 tokens/second
Llama 3.1 70B (Q4): 15-25 tokens/second
Capable for serious work

Token speed guide:

<5 tok/s: noticeably slow, but usable for non-interactive tasks
5-15 tok/s: conversational with slight delay
15-30 tok/s: feels responsive
30+ tok/s: feels snappy

Practical Recommendations

For a first local AI setup:

Check your hardware first (we'll cover this in Chapter 7)
Start with a 7B model - good capability-to-requirements ratio
Use 4-bit quantization - significant quality for minimal hardware hit
Adjust expectations - local 7B != cloud 70B, but it's close enough for many tasks

The question isn't "can I run local AI" but "which model can I run that will be useful for my needs." For most users, the answer is "yes, a 7B model with quantization."