Audio
tts
voice generation
speech synthesis

Text-to-Speech (TTS)

Generating natural-sounding speech from text. F5-TTS, XTTS-v2, Kokoro, Sesame CSM-1B lead open-weight TTS in 2026.

Capability notes

Open-weight TTS in 2026 spans three architectures with distinct quality-latency profiles. [Kokoro TTS](/tools/llama-cpp) (~82M params, StyleTTS 2) achieves 4.2–4.4 MOS — within 0.3 MOS of ElevenLabs at 4.7 — at 20–40× real-time on [consumer GPUs](/hardware/rtx-4060-ti-16gb). Voice consistency holds across ~500 words; beyond that, pitch drift accumulates. Multilingual: English, Japanese, Korean, Chinese, French, Spanish, German, Italian, Portuguese, Hindi — Japanese and Korean are strong; Hindi and French show accent artifacts. F5-TTS (~335M params, flow-matching) matches ElevenLabs on zero-shot voice cloning — 3 seconds of reference audio produces 85–90% speaker similarity. Weakness: 2–4× real-time generation — rules out real-time streaming. Excels at pre-recorded content: audiobooks, voiceover, podcast generation where latency tolerance is minutes-to-hours. XTTS-v2 (Coqui AI, ~1.1B params) supports 17 languages with fine-tuning capability. Fine-tune on 6–10 minutes of target audio for 90–95% speaker similarity stable across 5,000+ words — best open-weight option for long-form production. Tradeoff: 1–2× real-time on GPU; CPU generation is 30–60× real-time — a 10-minute clip runs for 5–10 hours. The operational landscape: open-weight TTS is 85–90% of paid API quality at zero per-character cost, trading speed for cost savings. Open-weight dominates when privacy matters (text must not leave your infrastructure), at scale (>10 hours audio/month), or when you need voice persistence across tooling changes (you own the checkpoint).

If you just want to try this

Lowest-friction path to a working setup.

Install [LM Studio](/tools/lm-studio), launch it, and from the in-app model browser search "kokoro-tts" and download the GGUF quantized version (~400 MB). [LM Studio](/tools/lm-studio) bundles a TTS interface — type text, select a voice preset from the dropdown (default: "af_heart" — American female), click Generate. Produces a WAV file for playback or export. For command-line use with [llama.cpp](/tools/llama-cpp)'s built-in TTS: ```bash wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-kokoro-en-v0_4-q8_0.gguf ./llama-tts -m ggml-kokoro-en-v0_4-q8_0.gguf -f input.txt -v af_heart -o output.wav ``` CPU generation: 30–60× real-time — a 1-minute passage in 2–4 minutes. GPU (4 GB+): 80–150× real-time — 1-minute in 20–40 seconds. Available voices (~30 presets): `af_heart` (American female, warm), `am_adam` (American male), `bf_isabella` (British female), `bm_george` (British male, deep), `jf_alpha` (Japanese female). Test 3–4 voices with your text — voice perception is subjective. Hardware: 2 GB VRAM minimum. Any GPU manufactured after 2019 qualifies. CPU-only is usable for batch. This is the lowest hardware barrier of any local AI task on this site. For voice cloning with 3 seconds of reference audio: pinokio.ai → search "F5-TTS" → install. The web UI accepts reference audio + text and outputs cloned voice. F5-TTS needs 6–8 GB VRAM.

For production deployment

Operator-grade recommendation.

Production TTS splits on streaming (<500ms first audio) vs batch (pre-recorded, latency tolerant). **Streaming TTS.** Use Kokoro TTS with sentence-level chunking: receive text → split on sentence boundaries → generate audio per sentence → concatenate with cross-fade → stream. On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), each sentence generates in 100–300ms — total streaming latency under 500ms. **Batch TTS.** Use XTTS-v2 fine-tuned on target speaker for production-quality voice at 1–2× real-time. A 10-minute script takes 5–10 minutes on GPU. XTTS-v2 handles 5,000+ words with no voice drift, supports 17 languages. Fine-tuning: 6–10 minutes clean target audio → Coqui AI fine-tune script (10–20 minutes on GPU) → export ~4.5 GB checkpoint → use for all subsequent generations. **Throughput.** Kokoro on [RTX 4090](/hardware/rtx-4090): 150–250× real-time — 1-hour audiobook in 15–25 minutes. On CPU (M4 Max): 40–60× real-time. XTTS-v2 on [RTX 4090](/hardware/rtx-4090): 1.5–2.5× real-time. On CPU: 0.03–0.05× real-time — 1-hour takes 20–30 hours on CPU, making XTTS-v2 GPU-essential for production. **API vs self-host.** ElevenLabs: $0.30/1K characters. At 1M chars/month (~15 hours audio): $300/month. Self-hosted Kokoro on shared [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) (~$80/month): saves $220/month. At 10M chars/month: saves $2,220/month. Self-hosted dominates at any non-trivial scale. API wins only when needing ElevenLabs' 1,000+ voice library, streaming under 200ms, or zero infrastructure. **Voice bank management.** Store fine-tuned XTTS-v2 checkpoints in versioned object storage (S3/MinIO). Tag each checkpoint with source speaker ID, fine-tuning date, quality metrics (MOS, speaker similarity). Implement testing pipeline: on checkpoint registration, generate standard "rainbow passage," run MOS evaluation, gate deployment on MOS >= 4.0 and similarity >= 90%.

What breaks

Failure modes operators see in the wild.

**Voice drift on long passages.** Symptom: after 300–500 words, voice subtly changes — pitch shifts, timbre flattens. Kokoro is susceptible; XTTS-v2 fine-tuned is largely immune up to 5,000 words. Cause: autoregressive models accumulate pitch/timbre errors across long sequences. Mitigation: split at natural paragraph boundaries, re-inject speaker reference embedding at each new paragraph. For XTTS-v2, keep passages under 3,000 words per batch. **Pronunciation errors on rare words.** Symptom: proper nouns, technical terms, loanwords mispronounced — "Dijkstra" becomes "dijik-struh," "SQL" becomes "squeal." Cause: grapheme-to-phoneme conversion relies on training-data frequencies — unseen words default to naive letter-to-sound rules. Mitigation: maintain pronunciation dictionary mapping problematic words to ARPAbet/IPA phonemes, preprocess text to replace known-problematic tokens with phoneme-tagged versions. XTTS-v2 fine-tuning teaches domain vocabulary if those words appear in fine-tuning audio. **Prosody breakdown on punctuation.** Symptom: monotone delivery on questions, rising intonation on statements, wrong pause positions. Cause: imperfect punctuation-to-prosody mapping — period uniformly signals "end of utterance" without recognizing rhetorical questions or quoted speech. Mitigation: use SSML tags to control prosody: `<prosody rate="slow" pitch="high">` for questions, `<break time="300ms"/>` for explicit pauses. Kokoro supports basic SSML; XTTS-v2 supports SSML through Coqui. Preprocess text to add SSML based on sentence type detection. **Latency spikes under concurrent load.** Symptom: TTS time spikes from 300ms to 3–8 seconds under concurrent requests. Cause: GPU fully occupied by first request; subsequent requests queue, adding 100–300ms per context switch. Mitigation: one model instance per GPU (no multiplexing), Redis-backed FIFO queue, max 1 concurrent per GPU for streaming (latency-sensitive) or N=VRAM_GB/8 for batch (throughput-sensitive). For streaming, deploy multiple small GPUs each serving one stream. **Language mixing within a passage.** Symptom: English text containing French names or Latin phrases causes unnatural pronunciation. Cause: TTS model selects a single language phoneme inventory — cannot switch mid-passage. Mitigation: preprocess multilingual text to isolate foreign segments, generate each with appropriate language-specific voice, concatenate in post-production.

Hardware guidance

TTS has the lowest hardware barrier of any local AI workload. Kokoro (82M params, GGUF Q8 = ~400 MB) runs on 2 GB VRAM or CPU alone. **CPU-only ($0).** Kokoro on modern desktop CPUs: 40–60× real-time — 1-minute passage in 1–1.5 minutes. Sufficient for batch overnight processing. [Apple M4 Pro](/hardware/apple-m4-pro) Neural Engine: 60–80× real-time via CoreML — Apple Silicon unique advantage. **Entry GPU ($300–600).** Any 4 GB+ GPU: Kokoro at 80–150× real-time. [RTX 3060 12GB](/hardware/rtx-3060-12gb), [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), [Intel Arc B580](/hardware/intel-arc-b580) all qualify. XTTS-v2 needs 6–8 GB for FP16 — [RTX 3060 12GB](/hardware/rtx-3060-12gb) at 1.0–1.5× real-time. F5-TTS needs 6–8 GB — all 6 GB+ GPUs qualify. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: overkill — Kokoro uses ~2 GB, leaving 22 GB idle. Value is bandwidth (1.0 TB/s): 250–300× real-time on Kokoro. [RTX 5080](/hardware/rtx-5080) at 16 GB is the practical sweet spot — 150–200× real-time at $1,000, best $/throughput for TTS. **Enterprise ($8,000+).** Enterprise GPUs are unnecessary. Single [RTX 5080](/hardware/rtx-5080) serving Kokoro generates 150–200 hours/day — more than most organizations' total demand. TTS scales horizontally: 4× [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) (~$2,000) serves 4 concurrent users at same total throughput as 1× 5080. More small GPUs beat one large GPU. **Streaming vs batch strategy.** Streaming: dedicated small GPUs, one stream per GPU, under 30% utilization to guarantee sub-500ms latency. Batch: shared large GPUs, queue jobs, maximize utilization at 80–90%. CPU batch is viable overnight — 16-core CPU generates 15–25 hours of Kokoro audio per 24-hour day.

Runtime guidance

**Kokoro via llama.cpp vs XTTS-v2 via Coqui vs F5-TTS.** Kokoro TTS is distributed as GGUF and served via [llama.cpp](/tools/llama-cpp)'s `llama-tts` binary. One binary, one model file, one command = audio. GGUF provides Q4–Q8 quantization: Q8 (400 MB) for max quality, Q4 (200 MB) for minimum VRAM. Quality difference between Q8 and FP16 is negligible (<0.05 MOS). GPU offload via `-ngl 999`. Limitations: no SSML, no voice cloning (only baked-in presets), no streaming API — offline batch only. XTTS-v2 is served via Coqui TTS Python library (`pip install TTS`) with the richest production feature set: fine-tuning API, SSML, 17 languages, streaming via WebSocket, REST API server. Tradeoff: Python + PyTorch + CUDA environment, ~8 GB model files per fine-tuned voice, slower maintenance cadence than llama.cpp. Use when you need voice cloning, multi-language, or fine-tuning. F5-TTS uses Gradio web UI for interactive use plus a Python API. Differentiator: zero-shot voice cloning quality — 3 seconds reference audio produces a clone that passes casual listening tests on short (<30 second) passages. Tradeoff: research-quality project with minimal production infrastructure — no REST API, no streaming, no concurrency. Best for voice prototyping, not production backend. **Decision tree.** Non-technical users: [LM Studio](/tools/lm-studio) → search "kokoro-tts" → type → generate. CLI batch: [llama.cpp](/tools/llama-cpp) `llama-tts` + Kokoro GGUF. Production with voice customization: XTTS-v2 via Coqui TTS server — fine-tune, REST API, SSML, 17 languages. Zero-shot cloning demos: F5-TTS via Gradio. Quality ranking: XTTS-v2 fine-tuned > F5-TTS zero-shot > Kokoro. Speed ranking: Kokoro (150–250×) > XTTS-v2 (1–2×) > F5-TTS (2–4×). **Open-weight vs API.** Kokoro at 4.2–4.4 MOS vs ElevenLabs Turbo at 4.7 = 0.3–0.5 gap. Imperceptible outside A/B testing for most uses (podcast, voiceover, audiobook). Gap matters for commercial voice-acting where 0.3 MOS drives listener preference. XTTS-v2 fine-tuned closes to within 0.2 MOS of ElevenLabs on the fine-tuned speaker — at that point, the decision is about speaker management convenience (ElevenLabs UI vs self-managed checkpoints), not audio quality.

Setup walkthrough

  1. pip install kokoro-onnx soundfile (Kokoro TTS with ONNX runtime — CPU-friendly).
  2. Download the Kokoro v0.19 model: wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx and wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/voices.json.
  3. Python script: from kokoro_onnx import Kokoro; kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json"); samples, sr = kokoro.create("Hello world, this is local TTS.", voice="af_sarah"); import soundfile as sf; sf.write("output.wav", samples, sr).
  4. First audio in 1-3 seconds on CPU. Quality is near-frontier for open-weight TTS.
  5. Alternative: pip install TTS (Coqui TTS) → tts --text "Hello world" --model_name tts_models/en/ljspeech/tacotron2-DDC --out_path output.wav.

For voice cloning: use F5-TTS (pip install f5-tss) — provide 5-10 seconds of reference audio.

The cheap setup

Kokoro TTS runs entirely on CPU at 2-5× real-time on any modern laptop. No GPU required. Any $300-400 laptop with Ryzen 5 or Intel i5 + 8 GB RAM will generate high-quality speech. For faster batch TTS, a used GTX 1060 6 GB ($60) provides 5-10× speedup via GPU-accelerated ONNX. If you need voice cloning specifically, add an RTX 2060 6 GB ($100 used) — F5-TTS needs ~4-6 GB VRAM for reference-based cloning.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) is more than sufficient for production TTS. F5-TTS with voice cloning generates 10 seconds of speech in ~1-2 seconds. Kokoro ONNX on GPU achieves 50-100× real-time. For podcast-generation pipelines, pair with a fast CPU (Ryzen 7 7700X) for the orchestration + audio post-processing. Total build: ~$800-1,000. TTS is one of the most GPU-efficient AI workloads — a 6 GB card handles most models.

Common beginner mistake

The mistake: Installing XTTS or Coqui TTS and running the largest available model variant, then wondering why it takes 30 seconds to generate 2 seconds of audio on a laptop CPU. Why it fails: Full-size autoregressive TTS models (XTTS-v2, Tortoise) are designed for quality on GPU — on CPU they run at 0.01-0.05× real-time. The fix: Use Kokoro ONNX for CPU — it's 50-100× faster and quality is 90% as good. Or use a streaming architecture like F5-TTS which generates in parallel rather than sequentially. Only use XTTS/Tortoise if you need the specific voice cloning quality and have a GPU.

Recommended setup for text-to-speech (tts)

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running text-to-speech (tts) locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle text-to-speech (tts) before committing money.

Hardware buying guidance for Text-to-Speech (TTS)

Voice cloning, TTS, and audio generation models trade VRAM for output quality — most operators undersize here.

Specialized buyer guides
Updated 2026 roundup