RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 7
Voice AI with Local Models

07. TTS: XTTS-v2

Chapter 7 of 22 · 15 min
KEY INSIGHT

XTTS-v2 voice cloning enables personalized experiences but requires clean reference audio and patience for generation time.

XTTS-v2 offers voice cloning from brief audio samples plus multi-language support. The model processes text and reference audio, producing speech matching the reference voice's characteristics.

Installation from source:

git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .

Voice cloning inference:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-modal/tortoise-v2")

# Clone voice from reference audio
tts.tts_to_file(
    text="This audio matches the speaker characteristics from the reference.",
    speaker_wav="reference_voice.wav",
    file_path="output.wav",
    language="en"
)

The reference audio should contain clear speech lasting 3-30 seconds. Longer samples provide more voice characteristics but increase processing time.

Voice embedding persistence across sessions requires saving speaker embeddings:

import numpy as np

# Generate and save embedding
speaker = tts.speaker_manager.speaker_by_name["my_cloned_voice"]
embedding = speaker["embedding"]

np.save("voice_embedding.npy", embedding)

Multi-language support covers dozens of languages with varying quality. Switching languages while maintaining voice clone requires models trained with that speaker across languages—rare in practice.

Generation speed varies significantly by model variant. The tortoise model produces high quality but runs slowly. Fast alternatives sacrifice quality for throughput. Benchmark on target hardware before committing to deployment architecture.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Record a 10-second voice sample and generate cloned speech. Compare the output against the original reference in terms of clarity, prosody, and voice similarity. (15 minutes)

← Chapter 6
TTS Options: Kokoro
Chapter 8 →
TTS: Piper