RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 9
Voice AI with Local Models

09. STT→LLM→TTS Pipeline

Chapter 9 of 22 · 15 min
KEY INSIGHT

Pipeline latency equals the sum of stage latencies unless stages overlap through streaming or concurrency.

Connecting speech recognition, language model inference, and speech synthesis requires careful state management and error handling across component boundaries.

Core pipeline structure:

import whisper
import numpy as np
import soundfile as sf
from kokoro_onnx import Kokoro

class VoicePipeline:
    def __init__(self):
        self.stt = whisper.load_model("base")
        self.tts = Kokoro("kokoro-v1.0.onnx", "af_sarah.onnx")
        self.llm = None  # Initialize based on chosen LLM
    
    def process_audio(self, audio_array):
        # STT stage
        result = self.stt.transcribe(audio_array, language="english")
        input_text = result["text"]
        
        if not input_text.strip():
            return None
        
        # LLM stage
        response_text = self.query_llm(input_text)
        
        # TTS stage
        audio_response = self.tts.create(response_text, voice="af_sarah")
        
        return audio_response
    
    def query_llm(self, prompt):
        raise NotImplementedError("Configure LLM backend")
    
    def process_audio_chunk(self, audio_path):
        audio, sr = sf.read(audio_path, dtype=np.float32)
        if sr != 16000:
            import librosa
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
        return self.process_audio(audio)

Error handling at each boundary prevents cascade failures. Malformed audio should fail gracefully at STT. LLM failures should return a generic response rather than crashing. TTS failures on long text should truncate rather than refuse output.

Timing coordination matters for perceived performance. Compute STT immediately upon VAD triggering. Begin LLM inference as soon as transcription completes. Start TTS generation incrementally for streaming audio output.

Concurrency enables overlapping computation stages:

from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=3)

def pipeline_streaming(audio_array):
    stt_future = executor.submit(self.stt.transcribe, audio_array)
    
    transcription = stt_future.result()
    
    llm_future = executor.submit(self.query_llm, transcription["text"])
    
    response = llm_future.result()
    
    tts_future = executor.submit(self.tts.create, response)
    
    audio_response = tts_future.result()
    
    return audio_response

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement the VoicePipeline class with placeholder LLM. Process a test audio file and measure time spent in each stage. Identify the bottleneck. (15 minutes)

← Chapter 8
TTS: Piper
Chapter 10 →
Real-Time Architecture