Local audio models

Audio is the AI category that runs best entirely local — privacy matters more for voice than almost any other modality, and the models are small enough to fit on a laptop. Whisper-base is 74M parameters and transcribes near-realtime on CPU. Kokoro-82M is the same size and synthesises speech at >10× realtime on a 4090.

Coverage today: ASR — Whisper-tiny/base/small for CPU-class deployments, Distil-Whisper-large-v3 for 6× the throughput of the original large-v3 with marginal WER cost, Parakeet-TDT-0.6B for the current open WER leader (NVIDIA). TTS — Kokoro-82M for ultra-fast English/multilingual generation, XTTS-v2 for voice cloning, F5-TTS for flow-matching speed, Orpheus for LLaMA-style controllable speech, Piper for edge-deployable per-language voices.

Each row calls out license posture explicitly. The XTTS-v2 license history is messier than most — we flag exactly what the current Coqui license permits. Whisper is MIT; Kokoro is Apache 2.0; Parakeet uses NVIDIA's permissive non-commercial-clarified license.

Other / from-scratch

Building a local voice pipeline?