RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / Text-to-Speech (TTS)
Natural language processing

Text-to-Speech (TTS)

Text-to-Speech (TTS) converts written text into spoken audio using neural models. Operators encounter TTS when running local models like Piper, Coqui TTS, or Meta's MMS. TTS models generate waveforms from text tokens, typically using a two-stage pipeline: a text-to-spectrogram model (e.g., Tacotron, FastSpeech) followed by a vocoder (e.g., HiFi-GAN, WaveGlow) that converts spectrograms into audio. Modern end-to-end models like Bark or XTTS combine these steps. Latency and quality depend on model size and hardware: smaller models run in real-time on CPU, while larger ones benefit from GPU acceleration. VRAM usage is modest (1-4 GB for most models), making TTS accessible on consumer hardware.

Deeper dive

TTS systems have evolved from concatenative synthesis (stitching pre-recorded phonemes) to parametric (using vocoders) and now neural models. The current standard is neural TTS, which uses deep learning to generate natural-sounding speech. Two common architectures are: (1) autoregressive models like Tacotron 2 that predict mel-spectrograms frame-by-frame, then feed them to a vocoder; (2) non-autoregressive models like FastSpeech that parallelize generation, offering lower latency. End-to-end models like Bark and XTTS directly generate raw audio tokens, often using a transformer decoder. Operators choose models based on voice quality, language support, and inference speed. For real-time applications, models like Piper (optimized for CPU) or Coqui TTS (GPU-accelerated) are popular. Fine-tuning TTS on custom voices requires a dataset of clean speech recordings and can be done with tools like Coqui Studio or custom scripts.

Practical example

On an RTX 3060 12GB, running Coqui TTS's XTTS-v2 model (~1.5 GB VRAM) generates 10 seconds of speech in about 2 seconds. For CPU-only inference, Piper's low-resource models (e.g., en_US-lessac-medium) run at ~2x real-time on an AMD Ryzen 5 5600X. VRAM usage rarely exceeds 4 GB, so TTS can run alongside other local AI tasks.

Workflow example

In LM Studio, load a TTS model like microsoft/speecht5_tts via the Hugging Face integration. After loading, type text in the TTS tab and click 'Generate' — the audio plays automatically. In Ollama, TTS is not natively supported; instead, use a separate tool like Piper: echo 'Hello world' | piper --model en_US-lessac-medium.onnx --output_file output.wav. For batch processing, write a Python script using torch and transformers to load SpeechT5 and save audio files.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →