I want my AI conversations to stay private — what's the realistic local-first setup?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Local AI for privacy is real and shipping today — but you have to know which workloads it actually covers and which it doesn't.
The framing matters. "Local AI" doesn't just mean "running a model" — it means the full inference loop happens on your hardware, no network calls after setup. That distinction matters for privacy.
What stays local (genuinely):
- The model weights (downloaded once, then loaded into VRAM)
- Your prompts (never leave the box)
- The model's output (rendered locally)
- Your chat history (your filesystem, not a vendor account)
- Embeddings of your documents (local vector store like ChromaDB or LanceDB)
What can still leak (without you noticing):
- Telemetry from the runtime (Ollama sends nothing; some other apps do)
- Update checks (most apps phone home for version checks)
- Cloud-backed features (Open WebUI's web search needs a cloud search API)
- Browser-extension AI assistants that route your text to a vendor
The realistic privacy-first stack:
| Layer | Pick | Why |
|---|---|---|
| Runtime | Ollama or llama.cpp | Zero telemetry, source-buildable |
| Chat UI | Jan, AnythingLLM, or PrivateGPT | All open-source, all air-gappable |
| Coding agent | Aider | Terminal-native, no cloud dependency |
| Document RAG | Khoj or PrivateGPT | Local embedder + local vector index |
| Voice | MacWhisper / Buzz | Local Whisper, no cloud STT |
| Mobile | Enchanted (iOS, talks to your home server) | No vendor account, no chat upload |
The migration playbook (3 steps):
Set up Ollama + Jan on your daily workstation. 30 minutes. Pull Llama 3.1 8B Q4_K_M as your default model (5GB). You now have a private chat with no vendor account.
Move your "ask AI about my docs" workflow to PrivateGPT or Khoj. Point it at the folders you'd otherwise paste into Claude/ChatGPT. Local embeddings + local generation = nothing leaves your machine.
Move coding agent work to Aider + Qwen 2.5 Coder. If you have a 24GB GPU. If you don't, Aider against Qwen 2.5 Coder 7B is still meaningfully better than nothing — and it's all local.
What stays cloud (be honest):
- One-shot complex reasoning that genuinely needs a frontier model (Claude Opus, GPT-5)
- Vision tasks at high quality (open-weight vision models are catching up but not there yet)
- 405B+ model capabilities (you can't run those locally without a datacenter)
The privacy reality check: if your threat model is "I don't want Anthropic / OpenAI / Google reading my prompts," local AI solves it completely for ~80% of typical use. If your threat model is "I want zero electronic trace of my AI use," that's a different problem (network isolation, file-system encryption, etc.) — local AI is necessary but not sufficient.
Explore the numbers for your specific stack
Where we got the numbers
Ollama telemetry policy: ollama.com docs + source code review. Jan license + air-gap design: janhq/jan README. The 'AI conversations never leave the box' threat model framing maps to what r/privacy threads typically ask for in 2026.
Also see
The cost-driven migration path (same destination as the privacy-driven one, different reason for going).
The OG offline-RAG project. The Khoj alternative when you want maximum auditable open-source.
Single-binary desktop AI. Curated model catalog. AGPL-3.0.
Hardware + runtime + model picks tuned to fully-offline operation.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
- Why doesn't my local LLM have web search — and what are the actual offline alternatives?
- Is fine-tuning dead in 2026? RAG vs distillation vs prompting — when does fine-tuning actually win?
- Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?
- Should I fine-tune, or just use a better prompt?
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.