Audio
voice cloning
voice replication
zero-shot voice

Voice Cloning

Replicating a specific voice from a few seconds of reference audio. F5-TTS and XTTS-v2 are zero-shot voice cloning leaders.

Capability notes

Zero-shot voice cloning creates synthetic speech in a target speaker's voice from a short reference clip — no fine-tuning, no per-speaker training. Two dominant open-weight systems in 2026: [**F5-TTS**](/tools/f5-tts) (flow-matching, newer architecture) and [**XTTS-v2**](/tools/xtts-v2) (Coqui AI, autoregressive + vocoder, battle-tested). Both achieve speaker similarity (SIM) scores of 0.80-0.92 on clean reference audio — a human listener identifies the target speaker ~80-92% of the time in A/B tests. Professional systems (ElevenLabs, OpenAI TTS) reach 0.90-0.96. **Minimum reference audio**: 6-10 seconds of clean, single-speaker audio at 16 kHz+. XTTS-v2 tolerates shorter clips (6s minimum per paper); F5-TTS benefits from 10-15s. Below 6 seconds, both produce generic "average voice" lacking timbral identity. Reference audio quality matters more than length — 6 seconds of studio recording outperforms 30 seconds of smartphone audio with room reverb. Clean single-speaker recordings without music, overlapping speech, or compression artifacts are essential. **Language coverage**: XTTS-v2 supports 17 languages with cross-language cloning — a 6-second English reference generates Chinese speech with moderate accent artifacts. F5-TTS focuses on English and Chinese with better prosody (natural rhythm) but narrower language support. OpenVoice offers finer-grained control over tone, accent, and emotion separately from timbre. **What zero-shot cloning cannot do**: Clone singing voice (requires specialized models like RVC), clone pathological speech patterns, maintain speaker identity beyond 60-90 seconds of continuous speech (drift begins, prosody flattens), or produce emotional range equivalent to the source speaker. Critical distinction: [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is a TTS model with fixed pre-defined voices — it does NOT clone arbitrary voices. Readers frequently confuse Kokoro with cloning-capable TTS.

If you just want to try this

Lowest-friction path to a working setup.

Download [TTS Generation WebUI](/tools/tts-generation-webui) (Gradio-based frontend bundling XTTS-v2, F5-TTS, and other TTS engines). This is the simplest path: the WebUI handles model download, quantization, and reference audio upload. No CLI or Python required. Record a 6-10 second WAV clip of the target voice — one sentence of neutral speech in a quiet room, exported as 16 kHz mono WAV. The cleaner the clip, the better the clone. Avoid WhatsApp voice messages (badly compressed), background music, overlapping speech, or heavy reverb. Upload the reference clip to the WebUI, type text, hit generate. First output takes 10-30 seconds on a GPU; subsequent generations are faster due to caching. Expect robotic initial results — adjust temperature (0.3-0.7) and speed until naturalness converges. Lower temperature = more stable but flatter; higher = more expressive but riskier artifacts. CPU-only generation is possible with XTTS-v2 but slow — 30-60 seconds per 10-second utterance. A GPU ([RTX 3060 12GB](/hardware/rtx-3060-12gb)) reduces this to 2-5 seconds. F5-TTS is GPU-only in practice — CPU generation exceeds 5 minutes per utterance. If you get poor results, the reference audio is usually the problem. Re-record with better mic placement (6-12 inches from mouth), eliminate room echo (record in a carpeted room with soft furnishings), and ensure the speaker maintains consistent pitch and pace throughout the clip.

For production deployment

Operator-grade recommendation.

Production voice cloning requires consistency, ethical guardrails, and throughput — problems absent in one-off demo use. **Consistency at scale**: Zero-shot models drift on long passages. Implement reference re-injection: every 20-30 seconds of generated speech, re-inject the original reference audio embedding to reset speaker identity. This reduces SIM score degradation from 0.05-0.10 (over 60s) to 0.01-0.02. Implementation: chunk text into 20-30s audio segments, prepend reference embedding to each chunk. **Ethical and legal guardrails**: Voice cloning without consent is illegal in multiple jurisdictions. Production systems need (a) explicit consent recording with timestamped audit trail, (b) speaker verification before cloning (voice biometric confirming reference matches registered speaker), (c) watermarking on output audio (AudioSeal by Meta — inaudible, detectable, encoding speaker ID + generation timestamp). Budget $0.001-0.005/second of output for watermark overhead. **Voice verification pipeline**: Run speaker embedding comparison (ECAPA-TDNN or WavLM) between reference and every output. Flag outputs below SIM 0.70 for human review. This catches degraded references producing unidentifiable clones. **Throughput optimization**: For batch generation (voicing 100 hours of educational content), XTTS-v2 with [vLLM](/tools/vllm)-style continuous batching reduces generation time 40-60% vs sequential. [RTX 4090](/hardware/rtx-4090) handles 3-5 concurrent streams; [NVIDIA L40S](/hardware/nvidia-l40s) handles 8-12. For real-time streaming (voice assistants, live dubbing), F5-TTS's flow-matching supports sub-500ms first-chunk latency on [RTX 4070](/hardware/rtx-4070)-class GPUs. **Voice library management**: Store speaker embedding vectors in [pgvector](/tools/pgvector) — ~512 floats per speaker, under 2 KB. Query: "find speaker matching this voice sample" → nearest neighbor search. Much more compact than raw reference audio storage.

What breaks

Failure modes operators see in the wild.

- **Speaker bleed (reference environment baked in).** The cloned voice carries acoustic artifacts from the reference recording — room tone, microphone color, background hum. Generated speech sounds like the target speaker in the wrong room. Mitigation: source separation (Demucs) on reference audio before embedding extraction to isolate dry voice. Residual artifacts remain at 5-10%. - **Low-quality source degradation.** Compressed references (WhatsApp, Zoom at 8 kHz, MP3 96 kbps) produce a "tinny" clone — high-frequency timbral detail lost in compression cannot be reconstructed. Mitigation: enforce minimum reference quality — 16 kHz, 16-bit, uncompressed, SNR >30 dB. Reject sub-threshold references rather than producing degraded output. - **Language mismatch artifacts.** Cloning an English speaker's voice to generate Chinese produces accent errors — the model maps English phonemes to Chinese tones, causing tonal mistakes. Voice timbre is correct but prosody is wrong-language. Mitigation: match reference language to target language. For cross-language cloning, use XTTS-v2's cross-language mode, accepting 10-15% quality reduction. - **Emotional flatness.** Zero-shot cloning captures speaker identity but not expressiveness. Output defaults to neutral reading style. Mitigation: provide per-emotion reference clips (happy, sad, urgent) and select the emotion-matched clip at generation time. This is per-emotion reference management, not automatic style transfer. - **Long-passage drift.** Speaker identity degrades after 60-90 seconds — SIM drops 0.08-0.15. Mitigation: segment text into <30-second chunks, re-extract and re-inject speaker embedding per chunk. Adds 5-10% latency but preserves identity. - **Artifact-on-silence.** Punctuation implying silence (periods, paragraph breaks) generates synthetic vocal fry or breath sounds absent from the reference. Mitigation: post-process with silence detector, replace artifacts with true silence or room tone from reference.

Hardware guidance

**Hobbyist (any GPU 6+ GB)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs XTTS-v2 at 2-5s per 10s utterance. F5-TTS requires 8+ GB. CPU-only is viable for XTTS-v2 batch — a modern 8-core CPU produces ~30-60s of speech per minute of compute. Acceptable for occasional use. [Apple M2 Ultra](/hardware/apple-m2-ultra) or [M3 Ultra](/hardware/apple-m3-ultra) Macs run XTTS-v2 via CPU fallback with acceptable latency for interactive use. **SMB ($1,500-$3,000)**: [RTX 4070 Ti](/hardware/rtx-4070-ti) or [RTX 5070 Ti](/hardware/rtx-5070-ti) (12-16 GB). Runs both XTTS-v2 and F5-TTS with <2s generation per utterance. Supports 2-3 concurrent streams for small production (weekly podcast, YouTube voiceovers). **Enterprise ($5,000-$15,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Runs XTTS-v2 + F5-TTS simultaneously on one GPU. 4-8 concurrent streams. [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB handles 8-16 concurrent streams — the right choice for multi-tenant voice cloning as a service. **Frontier**: Not applicable. Voice cloning is lightweight — XTTS-v2 is ~1.7B params, F5-TTS ~335M. Diminishing returns past a single L40S or RTX 4090. Scaling is horizontal (more GPUs for concurrent streams) not vertical. For sub-200ms first-chunk latency on real-time cloning, add a dedicated low-latency GPU ([RTX 4070 Super](/hardware/rtx-4070-super) or better). **Edge/mobile**: [Snapdragon 8 Gen 3](/hardware/snapdragon-8-gen-3) and [Snapdragon 8 Elite](/hardware/snapdragon-8-elite) run quantized XTTS-v2 on-device via [llama.cpp](/tools/llama-cpp) at 5-15s per utterance. F5-TTS is not practical for on-device.

Runtime guidance

**If you want the simplest GUI** → [TTS Generation WebUI](/tools/tts-generation-webui) (Gradio-based). Bundles XTTS-v2, F5-TTS, Bark. Handles model download, reference upload, parameter tuning without code. Supports batch generation and speaker library management. **If you need programmatic API access** → Coqui AI TTS library (`pip install TTS`) for [XTTS-v2](/tools/xtts-v2), or `pip install f5-tts` for [F5-TTS](/tools/f5-tts). Both expose: load model, load reference, synthesize. Both support streaming via iterator pattern for real-time playback. **If you need real-time streaming** → F5-TTS with streaming inference. Lower first-chunk latency (200-400ms vs XTTS-v2's 500ms-1s) because flow matching generates audio directly, while XTTS-v2 generates mel spectrograms sequenced through a vocoder. For live dubbing or voice assistants. **If you need batch production** → Wrap either model in FastAPI + job queue (Redis/RabbitMQ). Submit text + reference → get job ID → poll for completed audio. This handles GPU contention and provides failure recovery. Use continuous batching for multi-stream synthesis. **If you need voice conversion (not TTS)** → OpenVoice (MyShell) via its Gradio demo or Python API. Converts existing recorded speech to another speaker's voice while preserving linguistic content and prosody. Useful for dubbing or anonymization. **Important boundary**: [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) produces high-quality TTS with pre-defined voices but does NOT clone arbitrary voices. If someone suggests Kokoro for cloning, redirect to XTTS-v2 or F5-TTS.

Setup walkthrough

  1. pip install f5-tts (F5-TTS — zero-shot voice cloning from 5-10 seconds of reference audio).
  2. Download the model (auto on first run, ~1.2 GB).
  3. Record a clean 10-second audio clip of the target voice: arecord -f cd -d 10 reference.wav (Linux) or use Voice Recorder (Windows/Mac).
  4. Python script:
from f5_tts import F5TTS
tts = F5TTS("F5-TTS", "cuda")
tts.infer("Hello, I am a cloned voice speaking this sentence.", ref_audio="reference.wav", output="cloned.wav")
  1. First cloned audio in 5-15 seconds on 8+ GB GPU. Quality scales with reference audio quality — use a quiet room, no reverb.
  2. Alternative: pip install TTS → XTTS-v2 (tts --model_name tts_models/multilingual/multi-dataset/xtts_v2) — better speaker consistency, slower generation.

The cheap setup

F5-TTS needs 4-6 GB VRAM for inference. A used GTX 1660 Super 6 GB ($100) runs zero-shot cloning at 2-3 seconds per 10 seconds of generated audio. Kokoro ONNX runs on CPU at 2-5× real-time but doesn't clone — it uses preset voices. For voice cloning on a $300 budget: used GTX 1660 Super 6 GB ($100) + refurbished Dell Optiplex ($150) + 16 GB RAM ($30). Total: ~$280. If you only need preset voices (no cloning), any $300 laptop with Kokoro ONNX works great.

The serious setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb) is overkill — voice cloning is VRAM-light (4-8 GB needed). F5-TTS generates 10 seconds of cloned speech in ~1 second. XTTS-v2 in streaming mode generates real-time audio with cloned voice. For dubbing pipelines (STT → translate → clone → TTS), pair with Ryzen 7 7700X + 32 GB DDR5 + 1TB NVMe. Total: ~$800-1,000. Voice cloning is one of the most GPU-efficient tasks — even an RTX 2060 6 GB ($120 used) handles production workloads.

Common beginner mistake

The mistake: Recording reference audio from a laptop microphone in a noisy room with reverb, then wondering why the cloned voice sounds robotic. Why it fails: Voice cloning models extract speaker characteristics from the reference audio. Background noise, room echo, and low-quality microphones get embedded in the voice profile — the model can't distinguish "speaker voice" from "room acoustics." The fix: Record in a quiet room with soft furnishings (curtains, carpet). Use a $50-100 USB condenser mic (Blue Yeti Nano, Samson Q2U). Stand 6-12 inches from the mic. Keep reference audio 5-15 seconds, spoken at natural pace. Clean reference → clean clone. Bad reference → uncanny valley clone.

Recommended setup for voice cloning

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running voice cloning locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle voice cloning before committing money.

Hardware buying guidance for Voice Cloning

Voice cloning, TTS, and audio generation models trade VRAM for output quality — most operators undersize here.

Specialized buyer guides
Updated 2026 roundup