How do I build a fully-local voice-to-voice pipeline?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Three components, all local, all real-time-capable on a 12GB+ GPU.
The pipeline:
[microphone] → [STT: whisper.cpp] → [LLM: Ollama] → [TTS: Piper] → [speaker]
Component 1: Speech-to-text (Whisper)
- whisper.cpp with Whisper Large v3 Q5_0: ~1.5 GB VRAM. ggerganov's published benchmarks show consumer mid-range GPUs running this comfortably faster-than-realtime; measure on your hardware before sizing.
- whisper.cpp with Whisper Medium Q5_0: ~770 MB VRAM, materially faster than Large at a quality cost that matters less for clean-mic input than for noisy audio.
- Buzz (cross-platform Qt app) or MacWhisper (macOS native) for one-click setup
- Latency budget: under a second for short utterances on a 12GB GPU is the operator-grade target; your numbers will depend on chunking strategy and audio length.
Component 2: LLM (Ollama recommended)
- Llama 3.1 8B Q4_K_M: ~5 GB VRAM, fast enough on 8GB+ GPUs that streaming response keeps up with conversational pacing
- Latency budget: hundreds of ms for first token on consumer cards, then streaming
- Tip: prompt the model to keep responses short (1-2 sentences) for snappier voice UX
Component 3: Text-to-speech (Piper or XTTS)
- Piper: C++ binary, low first-audio latency (the project markets sub-100ms; depends on voice + hardware), voices are robotic-friendly but not magic
- Coqui XTTS-v2: ~2 GB VRAM, voice cloning + natural prosody but slower first-audio than Piper (project documentation reports several hundred ms)
- Picking: Piper for low-latency assistants, XTTS for "this should sound like a real person"
Total latency budget (rule of thumb):
- Piper + 8B + Whisper Medium on a 12GB GPU is the entry-grade end-to-end target — operators routinely report ~1-2 seconds, which is the threshold where conversational UX feels responsive.
- XTTS + larger LLM + Whisper Large pushes toward the 2-3 second range, which is fine for assistant-style turn-taking but starts to feel laggy for true conversational voice.
- Your numbers depend on chunking strategy, prompt length, and how aggressively you stream — measure with your actual prompts before committing to a hardware tier.
Hardware tiers:
- 8GB VRAM: Whisper Medium + Llama 3.2 3B + Piper. Works, low quality.
- 12GB VRAM: Whisper Large v3 + Llama 3.1 8B + Piper. The sweet spot.
- 24GB VRAM: Whisper Large v3 + Qwen 3 14B + XTTS. Premium quality.
Wiring it up:
The cleanest path is whisper-stream (a community wrapper around whisper.cpp's streaming mode) → HTTP POST to Ollama at localhost:11434 → pipe the response into Piper's stdin. About 50 lines of Python or Node. The OpenedAI-Speech project on GitHub does most of the TTS-side wiring out of the box.
Explore the numbers for your specific stack
Where we got the numbers
whisper.cpp realtime multipliers: ggerganov/whisper.cpp benchmarks repo. Piper sub-100ms first-audio: rhasspy/piper README. XTTS-v2 specs: coqui-ai/TTS documentation. Latency budgets from community pipeline reports r/LocalLLaMA + r/LocalLLM 2026.
Also see
Drop-in OpenAI-TTS-API server backed by local Piper/XTTS voices. The cleanest TTS wiring.
Native macOS app for Whisper transcription. Real-time mic mode.
Pre-filled rig recipe for under-$1000 voice-to-voice workstation.
Editorial verdict on the headline ASR model — what to expect, when to drop to Medium.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.