06. TTS Options: Kokoro
Kokoro provides efficient neural text-to-speech using ONNX runtime, enabling CPU inference without GPU requirements. The model produces natural-sounding English voices with fast generation speed.
Install Kokoro and dependencies:
pip install kokoro-onnx soundfile numpy
Download a voice pack. The repository provides multiple voice variants:
# Download a specific voice
wget https://github.com/remsky/KokoroOnnx/raw/refs/heads/main/voices/af_sarah.onnx
Basic inference pipeline:
import soundfile as sf
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "af_sarah.onnx")
text = "Hello, this is a voice synthesis test."
audio = kokoro.create(text, voice="af_sarah")
sf.write("output.wav", audio, 24000)
The create() method accepts text with SSML markup for pronunciation control. The English voice library covers American, British, and other regional accents.
Kokoro excels at short-to-medium utterances typical of response-driven conversations. Byte length limitations apply to input text length. Breaking longer texts into sentences and concatenating audio maintains natural prosody.
Prosody customization through SSML:
<speak>
<prosody rate="0.9" pitch="-2st">
This sentence is slightly slower and lower.
</prosody>
</speak>
Adjusting rate and pitch preserves emotional tone while fitting timing requirements. Experiment with parameter ranges to find natural-sounding settings for specific content types.
Common failure modes involve missing phoneme dictionaries. When unusual words render incorrectly, provide phonetic spelling or use SSML phoneme tags:
<phoneme alphabet="ipa" ph="ˈɛksələn">Excelon</phoneme>
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Generate spoken output for a paragraph of text, convert with two different voice presets, and compare generated audio files' prosody and clarity. (15 minutes)