Real-Time Architecture — Voice AI with Local Models (Chapter 10)

Real-time requirements fundamentally shape pipeline architecture. Latency targets for conversational AI typically fall under 2 seconds for perceived responsiveness. Achieving this requires overlapping stages and optimizing hot paths.

The critical path analysis:

User speaks → VAD detects → Audio recorded →
STT processing → LLM inference → TTS generation → Audio output

For a 5-second response, each stage must complete within its budget:

VAD detection: ~50ms continuous
Audio recording: ~500ms overlapping
STT processing: ~1000ms
LLM inference: ~500ms
TTS generation: ~500ms
Total potential: ~2 seconds

Stream overlapping eliminates sequential additive latency. VAD and STT operate on rolling audio buffers. LLM starts as soon as first transcription tokens appear. TTS streams partial audio as generation proceeds.

Streaming TTS implementation:

def streaming_tts(text, chunk_duration=0.5):
    """Generate audio in streaming chunks."""
    chunk_samples = int(chunk_duration * 24000)
    
    # Initialize synthesis
    synthesis = kokoro.synthesis_init(text)
    
    while not synthesis.done:
        chunk = synthesis.request_next(chunk_samples)
        if chunk is not None:
            yield chunk
        else:
            break
    
    # Flush remaining samples
    final_chunk = synthesis.finalize()
    if final_chunk is not None:
        yield final_chunk

Buffer management prevents memory accumulation during long conversations. Circular buffers discard audio older than STT processing window. Response history beyond context limits should truncate older turns to prevent LLM context overflow.

Latency hiding techniques:

Prefetch likely responses based on conversation context
Begin TTS with partial LLM output tokens
Use smaller, faster models as fallback during load
Queue requests during peak load rather than dropping

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.