Speech Synthesis
Speech synthesis, also known as text-to-speech (TTS), converts written text into spoken audio. In local AI, operators run TTS models like Piper or Coqui TTS on their own hardware. These models generate audio waveforms from text input, typically using neural network architectures like Tacotron or VITS. The output quality and speed depend on the model size and available compute—smaller models run faster on CPU, while larger models benefit from GPU acceleration. Operators choose between real-time inference (audio generated faster than playback) or batch processing for pre-rendering audio.
Deeper dive
Modern neural TTS systems consist of a text encoder, an acoustic model, and a vocoder. The text encoder converts characters or phonemes into linguistic features. The acoustic model (e.g., Tacotron2, FastSpeech) predicts a mel-spectrogram from those features. The vocoder (e.g., WaveGlow, HiFi-GAN) converts the mel-spectrogram into a raw audio waveform. End-to-end models like VITS combine these steps into a single network. Operators can choose from pre-trained models optimized for speed (e.g., Piper with ONNX runtime) or quality (e.g., Coqui TTS with VITS). Latency varies: a small Piper model may synthesize 1 second of audio in 0.1 seconds on CPU, while a large VITS model on GPU might achieve 0.05 seconds per second of audio. VRAM usage is modest (under 2 GB for most models), making TTS accessible on lower-end hardware.
Practical example
An operator with an RTX 3060 (12 GB VRAM) runs Piper TTS via piper --model en_US-lessac-medium.onnx --output_file output.wav to generate speech from a text file. The model loads in ~200 MB VRAM and produces audio at ~2x real-time on GPU. For higher quality, they switch to Coqui TTS with a VITS model: tts --text "Hello" --model_name tts_models/en/ljspeech/tacotron2-DDC which uses ~1 GB VRAM and runs at ~0.8x real-time on the same GPU.
Workflow example
In a local AI assistant workflow, the operator uses Ollama to generate a text response, then pipes it to a TTS engine. For example: ollama run llama3.2:3b "Tell me a joke" | piper --model en_US-lessac-medium.onnx --output_file joke.wav. The TTS step runs after the LLM completes, adding latency. To reduce delay, operators may pre-load the TTS model into memory or use streaming TTS (e.g., with Coqui TTS streaming API) to start playback before the full audio is generated.
Reviewed by Fredoline Eruo. See our editorial policy.