Capability notes
Zero-shot voice cloning creates synthetic speech in a target speaker's voice from a short reference clip — no fine-tuning, no per-speaker training. Two dominant open-weight systems in 2026: [**F5-TTS**](/tools/f5-tts) (flow-matching, newer architecture) and [**XTTS-v2**](/tools/xtts-v2) (Coqui AI, autoregressive + vocoder, battle-tested). Both achieve speaker similarity (SIM) scores of 0.80-0.92 on clean reference audio — a human listener identifies the target speaker ~80-92% of the time in A/B tests. Professional systems (ElevenLabs, OpenAI TTS) reach 0.90-0.96.
**Minimum reference audio**: 6-10 seconds of clean, single-speaker audio at 16 kHz+. XTTS-v2 tolerates shorter clips (6s minimum per paper); F5-TTS benefits from 10-15s. Below 6 seconds, both produce generic "average voice" lacking timbral identity. Reference audio quality matters more than length — 6 seconds of studio recording outperforms 30 seconds of smartphone audio with room reverb. Clean single-speaker recordings without music, overlapping speech, or compression artifacts are essential.
**Language coverage**: XTTS-v2 supports 17 languages with cross-language cloning — a 6-second English reference generates Chinese speech with moderate accent artifacts. F5-TTS focuses on English and Chinese with better prosody (natural rhythm) but narrower language support. OpenVoice offers finer-grained control over tone, accent, and emotion separately from timbre.
**What zero-shot cloning cannot do**: Clone singing voice (requires specialized models like RVC), clone pathological speech patterns, maintain speaker identity beyond 60-90 seconds of continuous speech (drift begins, prosody flattens), or produce emotional range equivalent to the source speaker. Critical distinction: [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is a TTS model with fixed pre-defined voices — it does NOT clone arbitrary voices. Readers frequently confuse Kokoro with cloning-capable TTS.
If you just want to try this
Lowest-friction path to a working setup.
Download [TTS Generation WebUI](/tools/tts-generation-webui) (Gradio-based frontend bundling XTTS-v2, F5-TTS, and other TTS engines). This is the simplest path: the WebUI handles model download, quantization, and reference audio upload. No CLI or Python required.
Record a 6-10 second WAV clip of the target voice — one sentence of neutral speech in a quiet room, exported as 16 kHz mono WAV. The cleaner the clip, the better the clone. Avoid WhatsApp voice messages (badly compressed), background music, overlapping speech, or heavy reverb.
Upload the reference clip to the WebUI, type text, hit generate. First output takes 10-30 seconds on a GPU; subsequent generations are faster due to caching. Expect robotic initial results — adjust temperature (0.3-0.7) and speed until naturalness converges. Lower temperature = more stable but flatter; higher = more expressive but riskier artifacts.
CPU-only generation is possible with XTTS-v2 but slow — 30-60 seconds per 10-second utterance. A GPU ([RTX 3060 12GB](/hardware/rtx-3060-12gb)) reduces this to 2-5 seconds. F5-TTS is GPU-only in practice — CPU generation exceeds 5 minutes per utterance.
If you get poor results, the reference audio is usually the problem. Re-record with better mic placement (6-12 inches from mouth), eliminate room echo (record in a carpeted room with soft furnishings), and ensure the speaker maintains consistent pitch and pace throughout the clip.
For production deployment
Operator-grade recommendation.
Production voice cloning requires consistency, ethical guardrails, and throughput — problems absent in one-off demo use.
**Consistency at scale**: Zero-shot models drift on long passages. Implement reference re-injection: every 20-30 seconds of generated speech, re-inject the original reference audio embedding to reset speaker identity. This reduces SIM score degradation from 0.05-0.10 (over 60s) to 0.01-0.02. Implementation: chunk text into 20-30s audio segments, prepend reference embedding to each chunk.
**Ethical and legal guardrails**: Voice cloning without consent is illegal in multiple jurisdictions. Production systems need (a) explicit consent recording with timestamped audit trail, (b) speaker verification before cloning (voice biometric confirming reference matches registered speaker), (c) watermarking on output audio (AudioSeal by Meta — inaudible, detectable, encoding speaker ID + generation timestamp). Budget $0.001-0.005/second of output for watermark overhead.
**Voice verification pipeline**: Run speaker embedding comparison (ECAPA-TDNN or WavLM) between reference and every output. Flag outputs below SIM 0.70 for human review. This catches degraded references producing unidentifiable clones.
**Throughput optimization**: For batch generation (voicing 100 hours of educational content), XTTS-v2 with [vLLM](/tools/vllm)-style continuous batching reduces generation time 40-60% vs sequential. [RTX 4090](/hardware/rtx-4090) handles 3-5 concurrent streams; [NVIDIA L40S](/hardware/nvidia-l40s) handles 8-12. For real-time streaming (voice assistants, live dubbing), F5-TTS's flow-matching supports sub-500ms first-chunk latency on [RTX 4070](/hardware/rtx-4070)-class GPUs.
**Voice library management**: Store speaker embedding vectors in [pgvector](/tools/pgvector) — ~512 floats per speaker, under 2 KB. Query: "find speaker matching this voice sample" → nearest neighbor search. Much more compact than raw reference audio storage.
What breaks
Failure modes operators see in the wild.
- **Speaker bleed (reference environment baked in).** The cloned voice carries acoustic artifacts from the reference recording — room tone, microphone color, background hum. Generated speech sounds like the target speaker in the wrong room. Mitigation: source separation (Demucs) on reference audio before embedding extraction to isolate dry voice. Residual artifacts remain at 5-10%.
- **Low-quality source degradation.** Compressed references (WhatsApp, Zoom at 8 kHz, MP3 96 kbps) produce a "tinny" clone — high-frequency timbral detail lost in compression cannot be reconstructed. Mitigation: enforce minimum reference quality — 16 kHz, 16-bit, uncompressed, SNR >30 dB. Reject sub-threshold references rather than producing degraded output.
- **Language mismatch artifacts.** Cloning an English speaker's voice to generate Chinese produces accent errors — the model maps English phonemes to Chinese tones, causing tonal mistakes. Voice timbre is correct but prosody is wrong-language. Mitigation: match reference language to target language. For cross-language cloning, use XTTS-v2's cross-language mode, accepting 10-15% quality reduction.
- **Emotional flatness.** Zero-shot cloning captures speaker identity but not expressiveness. Output defaults to neutral reading style. Mitigation: provide per-emotion reference clips (happy, sad, urgent) and select the emotion-matched clip at generation time. This is per-emotion reference management, not automatic style transfer.
- **Long-passage drift.** Speaker identity degrades after 60-90 seconds — SIM drops 0.08-0.15. Mitigation: segment text into <30-second chunks, re-extract and re-inject speaker embedding per chunk. Adds 5-10% latency but preserves identity.
- **Artifact-on-silence.** Punctuation implying silence (periods, paragraph breaks) generates synthetic vocal fry or breath sounds absent from the reference. Mitigation: post-process with silence detector, replace artifacts with true silence or room tone from reference.
Hardware guidance
**Hobbyist (any GPU 6+ GB)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs XTTS-v2 at 2-5s per 10s utterance. F5-TTS requires 8+ GB. CPU-only is viable for XTTS-v2 batch — a modern 8-core CPU produces ~30-60s of speech per minute of compute. Acceptable for occasional use. [Apple M2 Ultra](/hardware/apple-m2-ultra) or [M3 Ultra](/hardware/apple-m3-ultra) Macs run XTTS-v2 via CPU fallback with acceptable latency for interactive use.
**SMB ($1,500-$3,000)**: [RTX 4070 Ti](/hardware/rtx-4070-ti) or [RTX 5070 Ti](/hardware/rtx-5070-ti) (12-16 GB). Runs both XTTS-v2 and F5-TTS with <2s generation per utterance. Supports 2-3 concurrent streams for small production (weekly podcast, YouTube voiceovers).
**Enterprise ($5,000-$15,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Runs XTTS-v2 + F5-TTS simultaneously on one GPU. 4-8 concurrent streams. [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB handles 8-16 concurrent streams — the right choice for multi-tenant voice cloning as a service.
**Frontier**: Not applicable. Voice cloning is lightweight — XTTS-v2 is ~1.7B params, F5-TTS ~335M. Diminishing returns past a single L40S or RTX 4090. Scaling is horizontal (more GPUs for concurrent streams) not vertical. For sub-200ms first-chunk latency on real-time cloning, add a dedicated low-latency GPU ([RTX 4070 Super](/hardware/rtx-4070-super) or better).
**Edge/mobile**: [Snapdragon 8 Gen 3](/hardware/snapdragon-8-gen-3) and [Snapdragon 8 Elite](/hardware/snapdragon-8-elite) run quantized XTTS-v2 on-device via [llama.cpp](/tools/llama-cpp) at 5-15s per utterance. F5-TTS is not practical for on-device.
Runtime guidance
**If you want the simplest GUI** → [TTS Generation WebUI](/tools/tts-generation-webui) (Gradio-based). Bundles XTTS-v2, F5-TTS, Bark. Handles model download, reference upload, parameter tuning without code. Supports batch generation and speaker library management.
**If you need programmatic API access** → Coqui AI TTS library (`pip install TTS`) for [XTTS-v2](/tools/xtts-v2), or `pip install f5-tts` for [F5-TTS](/tools/f5-tts). Both expose: load model, load reference, synthesize. Both support streaming via iterator pattern for real-time playback.
**If you need real-time streaming** → F5-TTS with streaming inference. Lower first-chunk latency (200-400ms vs XTTS-v2's 500ms-1s) because flow matching generates audio directly, while XTTS-v2 generates mel spectrograms sequenced through a vocoder. For live dubbing or voice assistants.
**If you need batch production** → Wrap either model in FastAPI + job queue (Redis/RabbitMQ). Submit text + reference → get job ID → poll for completed audio. This handles GPU contention and provides failure recovery. Use continuous batching for multi-stream synthesis.
**If you need voice conversion (not TTS)** → OpenVoice (MyShell) via its Gradio demo or Python API. Converts existing recorded speech to another speaker's voice while preserving linguistic content and prosody. Useful for dubbing or anonymization.
**Important boundary**: [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) produces high-quality TTS with pre-defined voices but does NOT clone arbitrary voices. If someone suggests Kokoro for cloning, redirect to XTTS-v2 or F5-TTS.