How do I build a fully-local voice-to-voice pipeline?

Reviewed May 15, 20262 min read
voicewhisperpiperttssttreal-time

The answer

One paragraph. No hedging beyond what the data actually warrants.

Three components, all local, all real-time-capable on a 12GB+ GPU.

The pipeline:

[microphone] → [STT: whisper.cpp] → [LLM: Ollama] → [TTS: Piper] → [speaker]

Component 1: Speech-to-text (Whisper)

  • whisper.cpp with Whisper Large v3 Q5_0: ~1.5 GB VRAM. ggerganov's published benchmarks show consumer mid-range GPUs running this comfortably faster-than-realtime; measure on your hardware before sizing.
  • whisper.cpp with Whisper Medium Q5_0: ~770 MB VRAM, materially faster than Large at a quality cost that matters less for clean-mic input than for noisy audio.
  • Buzz (cross-platform Qt app) or MacWhisper (macOS native) for one-click setup
  • Latency budget: under a second for short utterances on a 12GB GPU is the operator-grade target; your numbers will depend on chunking strategy and audio length.

Component 2: LLM (Ollama recommended)

  • Llama 3.1 8B Q4_K_M: ~5 GB VRAM, fast enough on 8GB+ GPUs that streaming response keeps up with conversational pacing
  • Latency budget: hundreds of ms for first token on consumer cards, then streaming
  • Tip: prompt the model to keep responses short (1-2 sentences) for snappier voice UX

Component 3: Text-to-speech (Piper or XTTS)

  • Piper: C++ binary, low first-audio latency (the project markets sub-100ms; depends on voice + hardware), voices are robotic-friendly but not magic
  • Coqui XTTS-v2: ~2 GB VRAM, voice cloning + natural prosody but slower first-audio than Piper (project documentation reports several hundred ms)
  • Picking: Piper for low-latency assistants, XTTS for "this should sound like a real person"

Total latency budget (rule of thumb):

  • Piper + 8B + Whisper Medium on a 12GB GPU is the entry-grade end-to-end target — operators routinely report ~1-2 seconds, which is the threshold where conversational UX feels responsive.
  • XTTS + larger LLM + Whisper Large pushes toward the 2-3 second range, which is fine for assistant-style turn-taking but starts to feel laggy for true conversational voice.
  • Your numbers depend on chunking strategy, prompt length, and how aggressively you stream — measure with your actual prompts before committing to a hardware tier.

Hardware tiers:

  • 8GB VRAM: Whisper Medium + Llama 3.2 3B + Piper. Works, low quality.
  • 12GB VRAM: Whisper Large v3 + Llama 3.1 8B + Piper. The sweet spot.
  • 24GB VRAM: Whisper Large v3 + Qwen 3 14B + XTTS. Premium quality.

Wiring it up: The cleanest path is whisper-stream (a community wrapper around whisper.cpp's streaming mode) → HTTP POST to Ollama at localhost:11434 → pipe the response into Piper's stdin. About 50 lines of Python or Node. The OpenedAI-Speech project on GitHub does most of the TTS-side wiring out of the box.

Where we got the numbers

whisper.cpp realtime multipliers: ggerganov/whisper.cpp benchmarks repo. Piper sub-100ms first-audio: rhasspy/piper README. XTTS-v2 specs: coqui-ai/TTS documentation. Latency budgets from community pipeline reports r/LocalLLaMA + r/LocalLLM 2026.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.