RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / Speech Synthesis
Natural language processing

Speech Synthesis

Speech synthesis, also known as text-to-speech (TTS), converts written text into spoken audio. In local AI, operators run TTS models like Piper or Coqui TTS on their own hardware. These models generate audio waveforms from text input, typically using neural network architectures like Tacotron or VITS. The output quality and speed depend on the model size and available compute—smaller models run faster on CPU, while larger models benefit from GPU acceleration. Operators choose between real-time inference (audio generated faster than playback) or batch processing for pre-rendering audio.

Deeper dive

Modern neural TTS systems consist of a text encoder, an acoustic model, and a vocoder. The text encoder converts characters or phonemes into linguistic features. The acoustic model (e.g., Tacotron2, FastSpeech) predicts a mel-spectrogram from those features. The vocoder (e.g., WaveGlow, HiFi-GAN) converts the mel-spectrogram into a raw audio waveform. End-to-end models like VITS combine these steps into a single network. Operators can choose from pre-trained models optimized for speed (e.g., Piper with ONNX runtime) or quality (e.g., Coqui TTS with VITS). Latency varies: a small Piper model may synthesize 1 second of audio in 0.1 seconds on CPU, while a large VITS model on GPU might achieve 0.05 seconds per second of audio. VRAM usage is modest (under 2 GB for most models), making TTS accessible on lower-end hardware.

Practical example

An operator with an RTX 3060 (12 GB VRAM) runs Piper TTS via piper --model en_US-lessac-medium.onnx --output_file output.wav to generate speech from a text file. The model loads in ~200 MB VRAM and produces audio at ~2x real-time on GPU. For higher quality, they switch to Coqui TTS with a VITS model: tts --text "Hello" --model_name tts_models/en/ljspeech/tacotron2-DDC which uses ~1 GB VRAM and runs at ~0.8x real-time on the same GPU.

Workflow example

In a local AI assistant workflow, the operator uses Ollama to generate a text response, then pipes it to a TTS engine. For example: ollama run llama3.2:3b "Tell me a joke" | piper --model en_US-lessac-medium.onnx --output_file joke.wav. The TTS step runs after the LLM completes, adding latency. To reduce delay, operators may pre-load the TTS model into memory or use streaming TTS (e.g., with Coqui TTS streaming API) to start playback before the full audio is generated.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →