TTS Options: Kokoro — Voice AI with Local Models (Chapter 6)

Kokoro provides efficient neural text-to-speech using ONNX runtime, enabling CPU inference without GPU requirements. The model produces natural-sounding English voices with fast generation speed.

Install Kokoro and dependencies:

pip install kokoro-onnx soundfile numpy

Download a voice pack. The repository provides multiple voice variants:

# Download a specific voice
wget https://github.com/remsky/KokoroOnnx/raw/refs/heads/main/voices/af_sarah.onnx

Basic inference pipeline:

import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("kokoro-v1.0.onnx", "af_sarah.onnx")

text = "Hello, this is a voice synthesis test."
audio = kokoro.create(text, voice="af_sarah")

sf.write("output.wav", audio, 24000)

The create() method accepts text with SSML markup for pronunciation control. The English voice library covers American, British, and other regional accents.

Kokoro excels at short-to-medium utterances typical of response-driven conversations. Byte length limitations apply to input text length. Breaking longer texts into sentences and concatenating audio maintains natural prosody.

Prosody customization through SSML:

<speak>
  <prosody rate="0.9" pitch="-2st">
    This sentence is slightly slower and lower.
  </prosody>
</speak>

Adjusting rate and pitch preserves emotional tone while fitting timing requirements. Experiment with parameter ranges to find natural-sounding settings for specific content types.

Common failure modes involve missing phoneme dictionaries. When unusual words render incorrectly, provide phonetic spelling or use SSML phoneme tags:

<phoneme alphabet="ipa" ph="ˈɛksələn">Excelon</phoneme>

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.