RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 6
Voice AI with Local Models

06. TTS Options: Kokoro

Chapter 6 of 22 · 20 min
KEY INSIGHT

Kokoro's strength lies in CPU-capable inference with acceptable quality for conversational responses under 100 words.

Kokoro provides efficient neural text-to-speech using ONNX runtime, enabling CPU inference without GPU requirements. The model produces natural-sounding English voices with fast generation speed.

Install Kokoro and dependencies:

pip install kokoro-onnx soundfile numpy

Download a voice pack. The repository provides multiple voice variants:

# Download a specific voice
wget https://github.com/remsky/KokoroOnnx/raw/refs/heads/main/voices/af_sarah.onnx

Basic inference pipeline:

import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("kokoro-v1.0.onnx", "af_sarah.onnx")

text = "Hello, this is a voice synthesis test."
audio = kokoro.create(text, voice="af_sarah")

sf.write("output.wav", audio, 24000)

The create() method accepts text with SSML markup for pronunciation control. The English voice library covers American, British, and other regional accents.

Kokoro excels at short-to-medium utterances typical of response-driven conversations. Byte length limitations apply to input text length. Breaking longer texts into sentences and concatenating audio maintains natural prosody.

Prosody customization through SSML:

<speak>
  <prosody rate="0.9" pitch="-2st">
    This sentence is slightly slower and lower.
  </prosody>
</speak>

Adjusting rate and pitch preserves emotional tone while fitting timing requirements. Experiment with parameter ranges to find natural-sounding settings for specific content types.

Common failure modes involve missing phoneme dictionaries. When unusual words render incorrectly, provide phonetic spelling or use SSML phoneme tags:

<phoneme alphabet="ipa" ph="ˈɛksələn">Excelon</phoneme>

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Generate spoken output for a paragraph of text, convert with two different voice presets, and compare generated audio files' prosody and clarity. (15 minutes)

← Chapter 5
Voice Activity Detection
Chapter 7 →
TTS: XTTS-v2