Speech-to-text and text-to-speech models that run on your own hardware — Whisper, Distil-Whisper, Parakeet, Kokoro-82M, XTTS-v2, F5-TTS, Orpheus, Piper. The catalog used to have two Whisper rows hidden in /models; this hub puts the whole audio stack in one place.
Audio is the AI category that runs best entirely local — privacy matters more for voice than almost any other modality, and the models are small enough to fit on a laptop. Whisper-base is 74M parameters and transcribes near-realtime on CPU. Kokoro-82M is the same size and synthesises speech at >10× realtime on a 4090.
Coverage today: ASR — Whisper-tiny/base/small for CPU-class deployments, Distil-Whisper-large-v3 for 6× the throughput of the original large-v3 with marginal WER cost, Parakeet-TDT-0.6B for the current open WER leader (NVIDIA). TTS — Kokoro-82M for ultra-fast English/multilingual generation, XTTS-v2 for voice cloning, F5-TTS for flow-matching speed, Orpheus for LLaMA-style controllable speech, Piper for edge-deployable per-language voices.
Each row calls out license posture explicitly. The XTTS-v2 license history is messier than most — we flag exactly what the current Coqui license permits. Whisper is MIT; Kokoro is Apache 2.0; Parakeet uses NVIDIA's permissive non-commercial-clarified license.
82M-parameter StyleTTS2-derived TTS that went viral in early 2025 for matching billion-parameter TTS quality at ~1% the size. Apache-2.0 weights, dozens of preset voice packs across English (and growing language list), a
Coqui's flagship multilingual voice-cloning TTS — clones a speaker from a 6-second reference clip and synthesizes in 17 languages with cross-lingual transfer. Released under the Coqui Public Model License (CPML), which r
74M-parameter Whisper variant — roughly 2x the params of tiny for ~25-30% relative WER reduction. The standard pick for CPU realtime transcription with acceptable quality.
244M-parameter Whisper. The smallest Whisper checkpoint considered 'production grade' for non-English audio. Sweet spot for laptops with iGPU/Metal or modest discrete GPUs.
Smallest member of the Whisper encoder-decoder ASR family (39M params). Trained on 680k hours of weakly supervised multilingual audio. Targets sub-realtime transcription on CPU and tiny edge devices; ships in transformer
756M-param distilled Whisper-large-v3 with the decoder shrunk from 32 to 2 layers. ~6.3x faster than the teacher at near-parity WER on long-form English (1% absolute gap on out-of-distribution sets per the model card).
VITS-based neural TTS optimized for Raspberry Pi-class hardware. Ships as ONNX checkpoints with ~100 voices across 30+ languages. Powers Home Assistant's local voice stack and is the de facto open TTS for embedded device
600M-parameter FastConformer-TDT transducer ASR from NVIDIA NeMo. Topped the Hugging Face Open ASR Leaderboard in 2025 for English, with WER ~6.05% averaged across the leaderboard suite. Outputs word/segment timestamps n
Flow-matching non-autoregressive TTS built on a Diffusion Transformer (DiT) backbone with ConvNeXt text refinement. Trained on the 100K-hour Emilia dataset; supports zero-shot voice cloning with strong naturalness and lo
LLaMA-architecture 3B model fine-tuned as a TTS that emits SNAC audio tokens. Designed for highly expressive, emotion-controllable speech with laughter, sighs, and other paralinguistic markers via inline tags. Apache-2.0
OpenAI's flagship open speech-to-text model. 99 languages, MIT license. The de-facto open ASR baseline.
Distilled Whisper Large v3. ~8x faster decode at near-equivalent accuracy on most languages.
Pair an ASR model with a TTS model from this hub for a fully-offline assistant. The runtime guidance per row covers FasterWhisper, WhisperX, mlx-whisper, ONNX, and CPU vs GPU latency for each. See also best GPU for Whisper.