03. What Can You Actually Run Locally?
The Capability Range
Local AI isn't one thing—it's a spectrum of capability depending on your hardware. The same model runs differently on different machines, and different models have different requirements.
Small models (0.5B - 3B parameters):
- Run on CPUs with 8GB+ RAM
- Accessible to almost any modern machine
- Good for simple tasks: summarization, classification, basic generation
- Example models: Phi-2, TinyLlama, Qwen2-0.5B
Medium models (7B - 13B parameters):
- Run on modern laptops with 16GB+ RAM (using quantization)
- Run well on gaming GPUs (GTX 1060, RTX 3060, and better)
- Capable of coherent conversation, code generation, writing assistance
- Example models: Llama 3.2 7B, Mistral 7B, Phi-3.5 7B
Large models (30B - 70B parameters):
- Require dedicated GPU with 12GB+ VRAM minimum
- Top performance for local: approaches cloud quality on many tasks
- Example models: Llama 3.1 70B, Mistral Large, Command R+
Very large models (100B+ parameters):
- Require high-end GPUs (RTX 4090, A100) with significant VRAM
- Quantization essential—even then, hardware is a serious constraint
- For most users, this isn't practical yet
Real-World Performance Expectations
Here are concrete numbers from real hardware:
On a laptop with 16GB RAM and integrated graphics (no GPU):
- TinyLlama (1.1B): 5-10 tokens/second
- Phi-2 (2.7B): 3-6 tokens/second
- Usable for simple tasks with patience
On a mid-range gaming PC with RTX 3060 (12GB VRAM):
- Llama 3.2 7B (Q4 quantization): 20-30 tokens/second
- Mistral 7B (Q4): 25-35 tokens/second
- Real-time conversation feel
On a high-end GPU (RTX 4090, 24GB VRAM):
- Llama 3.2 13B (Q4): 40-60 tokens/second
- Llama 3.1 70B (Q4): 15-25 tokens/second
- Capable for serious work
Token speed guide:
- <5 tok/s: noticeably slow, but usable for non-interactive tasks
- 5-15 tok/s: conversational with slight delay
- 15-30 tok/s: feels responsive
- 30+ tok/s: feels snappy
Practical Recommendations
For a first local AI setup:
- Check your hardware first (we'll cover this in Chapter 7)
- Start with a 7B model - good capability-to-requirements ratio
- Use 4-bit quantization - significant quality for minimal hardware hit
- Adjust expectations - local 7B != cloud 70B, but it's close enough for many tasks
The question isn't "can I run local AI" but "which model can I run that will be useful for my needs." For most users, the answer is "yes, a 7B model with quantization."
Run a benchmark on your current machine. Use a tool like ollama (we'll install it in Chapter 8) and run a simple prompt with a small model, measuring how many tokens per second you get. This gives you real numbers for what your hardware can handle.