08. TTS: Piper

Chapter 8 of 22 · 20 min

Piper provides on-device TTS optimized for Raspberry Pi and embedded deployment. The models are extremely efficient, running on CPU with minimal power consumption.

Installation:

pip install piper-tts numpy

Download a voice model:

# Download English US voice
wget https://github.com/rhasspy/piper/raw/master/voicess/l_Jenny_Compact/or,柳柳,柳,柳,柳,柳
que.onnx
wget https://github.com/rhasspy/piper/raw/master/voicess/l_Jenny_Compact/or,柳,柳,柳,柳,柳
que.onnx.json

Generation pipeline:

import subprocess
import wave
import struct

text = "Piper text to speech runs efficiently on resource-constrained devices."

with wave.open("output.wav", "wb") as wav:
    wav.setnchannels(1)
    wav.setsampwidth(2)
    wav.setframerate(22050)
    
    process = subprocess.Popen(
        ["piper", "--model", "en_US-Jenny_Compact.onnx",
         "--output-raw"],
        stdin=subprocess.PIPE,
        stdout=wav,
        stderr=subprocess.DEVNULL
    )
    process.communicate(input=text.encode())

Batch generation for long texts:

import subprocess

sentences = [
    "First sentence of the longer response.",
    "Second sentence builds on the first.",
    "Third sentence concludes the thought."
]

for i, sentence in enumerate(sentences):
    result = subprocess.run(
        ["piper", "--model", "en_US-Jenny_Compact.onnx",
         "--output-file", f"sentence_{i}.wav"],
        input=sentence.encode(),
        capture_output=True
    )

A known limitation involves handling extremely long inputs. Break text into sentences and concatenate audio segments for continuous playback. Include brief silence between segments for natural pacing.

Voice selection offers dozens of options with varying quality levels. Compact models sacrifice some quality for file size, suitable for embedded deployment. Standard and high-quality voices require more storage but produce more natural output.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Generate speech using a Piper voice on a text input. Measure execution time and note audio quality characteristics. Profile CPU usage during generation. (10 minutes)