How do I build a fully-local voice-to-voice pipeline?

The answer

One paragraph. No hedging beyond what the data actually warrants.

Three components, all local, all real-time-capable on a 12GB+ GPU.

The pipeline:

[microphone] → [STT: whisper.cpp] → [LLM: Ollama] → [TTS: Piper] → [speaker]

Component 1: Speech-to-text (Whisper)

whisper.cpp with Whisper Large v3 Q5_0: ~1.5 GB VRAM. ggerganov's published benchmarks show consumer mid-range GPUs running this comfortably faster-than-realtime; measure on your hardware before sizing.
whisper.cpp with Whisper Medium Q5_0: ~770 MB VRAM, materially faster than Large at a quality cost that matters less for clean-mic input than for noisy audio.
Buzz (cross-platform Qt app) or MacWhisper (macOS native) for one-click setup
Latency budget: under a second for short utterances on a 12GB GPU is the operator-grade target; your numbers will depend on chunking strategy and audio length.

Component 2: LLM (Ollama recommended)

Llama 3.1 8B Q4_K_M: ~5 GB VRAM, fast enough on 8GB+ GPUs that streaming response keeps up with conversational pacing
Latency budget: hundreds of ms for first token on consumer cards, then streaming
Tip: prompt the model to keep responses short (1-2 sentences) for snappier voice UX

Component 3: Text-to-speech (Piper or XTTS)

Piper: C++ binary, low first-audio latency (the project markets sub-100ms; depends on voice + hardware), voices are robotic-friendly but not magic
Coqui XTTS-v2: ~2 GB VRAM, voice cloning + natural prosody but slower first-audio than Piper (project documentation reports several hundred ms)
Picking: Piper for low-latency assistants, XTTS for "this should sound like a real person"

Total latency budget (rule of thumb):

Piper + 8B + Whisper Medium on a 12GB GPU is the entry-grade end-to-end target — operators routinely report ~1-2 seconds, which is the threshold where conversational UX feels responsive.
XTTS + larger LLM + Whisper Large pushes toward the 2-3 second range, which is fine for assistant-style turn-taking but starts to feel laggy for true conversational voice.
Your numbers depend on chunking strategy, prompt length, and how aggressively you stream — measure with your actual prompts before committing to a hardware tier.

Hardware tiers:

8GB VRAM: Whisper Medium + Llama 3.2 3B + Piper. Works, low quality.
12GB VRAM: Whisper Large v3 + Llama 3.1 8B + Piper. The sweet spot.
24GB VRAM: Whisper Large v3 + Qwen 3 14B + XTTS. Premium quality.

Wiring it up: The cleanest path is whisper-stream (a community wrapper around whisper.cpp's streaming mode) → HTTP POST to Ollama at localhost:11434 → pipe the response into Piper's stdin. About 50 lines of Python or Node. The OpenedAI-Speech project on GitHub does most of the TTS-side wiring out of the box.

Explore the numbers for your specific stack

Open the voice apps directory →

MacWhisper, Buzz, OpenedAI-Speech — the three apps you'd combine for a voice pipeline.

Where we got the numbers

whisper.cpp realtime multipliers: ggerganov/whisper.cpp benchmarks repo. Piper sub-100ms first-audio: rhasspy/piper README. XTTS-v2 specs: coqui-ai/TTS documentation. Latency budgets from community pipeline reports r/LocalLLaMA + r/LocalLLM 2026.

Also see

OpenedAI-Speech (TTS bridge) →

Drop-in OpenAI-TTS-API server backed by local Piper/XTTS voices. The cleanest TTS wiring.

MacWhisper (macOS STT) →

Native macOS app for Whisper transcription. Real-time mic mode.

Voice stack builder →

Pre-filled rig recipe for under-$1000 voice-to-voice workstation.

Whisper Large v3 model page →

Editorial verdict on the headline ASR model — what to expect, when to drop to Medium.