Natural language processing

Text-to-Speech (TTS)

Text-to-Speech (TTS) converts written text into spoken audio using neural models. Operators encounter TTS when running local models like Piper, Coqui TTS, or Meta's MMS. TTS models generate waveforms from text tokens, typically using a two-stage pipeline: a text-to-spectrogram model (e.g., Tacotron, FastSpeech) followed by a vocoder (e.g., HiFi-GAN, WaveGlow) that converts spectrograms into audio. Modern end-to-end models like Bark or XTTS combine these steps. Latency and quality depend on model size and hardware: smaller models run in real-time on CPU, while larger ones benefit from GPU acceleration. VRAM usage is modest (1-4 GB for most models), making TTS accessible on consumer hardware.

Deeper dive

TTS systems have evolved from concatenative synthesis (stitching pre-recorded phonemes) to parametric (using vocoders) and now neural models. The current standard is neural TTS, which uses deep learning to generate natural-sounding speech. Two common architectures are: (1) autoregressive models like Tacotron 2 that predict mel-spectrograms frame-by-frame, then feed them to a vocoder; (2) non-autoregressive models like FastSpeech that parallelize generation, offering lower latency. End-to-end models like Bark and XTTS directly generate raw audio tokens, often using a transformer decoder. Operators choose models based on voice quality, language support, and inference speed. For real-time applications, models like Piper (optimized for CPU) or Coqui TTS (GPU-accelerated) are popular. Fine-tuning TTS on custom voices requires a dataset of clean speech recordings and can be done with tools like Coqui Studio or custom scripts.

Practical example

On an RTX 3060 12GB, running Coqui TTS's XTTS-v2 model (~1.5 GB VRAM) generates 10 seconds of speech in about 2 seconds. For CPU-only inference, Piper's low-resource models (e.g., en_US-lessac-medium) run at ~2x real-time on an AMD Ryzen 5 5600X. VRAM usage rarely exceeds 4 GB, so TTS can run alongside other local AI tasks.

Workflow example

In LM Studio, load a TTS model like microsoft/speecht5_tts via the Hugging Face integration. After loading, type text in the TTS tab and click 'Generate' — the audio plays automatically. In Ollama, TTS is not natively supported; instead, use a separate tool like Piper: echo 'Hello world' | piper --model en_US-lessac-medium.onnx --output_file output.wav. For batch processing, write a Python script using torch and transformers to load SpeechT5 and save audio files.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work