RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
BLK · AUDIOASR · transcription · TTS · voice

Local audio models

Speech-to-text and text-to-speech models that run on your own hardware — Whisper, Distil-Whisper, Parakeet, Kokoro-82M, XTTS-v2, F5-TTS, Orpheus, Piper. The catalog used to have two Whisper rows hidden in /models; this hub puts the whole audio stack in one place.

Models curated
12
Vendors
8
Commercial OK
10/12
Benchmarked
0/12

Audio is the AI category that runs best entirely local — privacy matters more for voice than almost any other modality, and the models are small enough to fit on a laptop. Whisper-base is 74M parameters and transcribes near-realtime on CPU. Kokoro-82M is the same size and synthesises speech at >10× realtime on a 4090.

Coverage today: ASR — Whisper-tiny/base/small for CPU-class deployments, Distil-Whisper-large-v3 for 6× the throughput of the original large-v3 with marginal WER cost, Parakeet-TDT-0.6B for the current open WER leader (NVIDIA). TTS — Kokoro-82M for ultra-fast English/multilingual generation, XTTS-v2 for voice cloning, F5-TTS for flow-matching speed, Orpheus for LLaMA-style controllable speech, Piper for edge-deployable per-language voices.

Each row calls out license posture explicitly. The XTTS-v2 license history is messier than most — we flag exactly what the current Coqui license permits. Whisper is MIT; Kokoro is Apache 2.0; Parakeet uses NVIDIA's permissive non-commercial-clarified license.

FAM · OTHER

Other / from-scratch

12 models
Kokoro 82M
82M params · Hexgrad
▸ Realtime English TTS on CPU for chatbots, audiobooks, and accessibility — best open license/size/quality combo in 2025-2026

82M-parameter StyleTTS2-derived TTS that went viral in early 2025 for matching billion-parameter TTS quality at ~1% the size. Apache-2.0 weights, dozens of preset voice packs across English (and growing language list), a

License
apache-2.0 · OK
Context
—
XTTS v2
460M params · Coqui
▸ Multilingual voice cloning from a short reference clip for personal or research use

Coqui's flagship multilingual voice-cloning TTS — clones a speaker from a 6-second reference clip and synthesizes in 17 languages with cross-lingual transfer. Released under the Coqui Public Model License (CPML), which r

License
Coqui Public Mod
Context
—
Whisper Base
74M params · OpenAI
▸ CPU-side meeting and voice-note transcription with multilingual coverage

74M-parameter Whisper variant — roughly 2x the params of tiny for ~25-30% relative WER reduction. The standard pick for CPU realtime transcription with acceptable quality.

License
apache-2.0 · OK
Context
0K
Whisper Small
244M params · OpenAI
▸ Multilingual transcription on consumer laptops where Apple Silicon or a small GPU is available

244M-parameter Whisper. The smallest Whisper checkpoint considered 'production grade' for non-English audio. Sweet spot for laptops with iGPU/Metal or modest discrete GPUs.

License
apache-2.0 · OK
Context
0K
Whisper Tiny
39M params · OpenAI
▸ Real-time English transcription on CPU-only edge devices and mobile

Smallest member of the Whisper encoder-decoder ASR family (39M params). Trained on 680k hours of weakly supervised multilingual audio. Targets sub-realtime transcription on CPU and tiny edge devices; ships in transformer

License
apache-2.0 · OK
Context
0K
Distil-Whisper Large v3
756M params · Hugging Face / Distil-Whisper
▸ High-throughput English transcription pipelines (podcasts, call center, batch ASR) on a single consumer GPU

756M-param distilled Whisper-large-v3 with the decoder shrunk from 32 to 2 layers. ~6.3x faster than the teacher at near-parity WER on long-form English (1% absolute gap on out-of-distribution sets per the model card).

License
mit · OK
Context
0K
Piper
25M params · Rhasspy / Mike Hansen
▸ Offline, on-device TTS for smart home, accessibility, and embedded Linux with strict CPU/RAM budgets

VITS-based neural TTS optimized for Raspberry Pi-class hardware. Ships as ONNX checkpoints with ~100 voices across 30+ languages. Powers Home Assistant's local voice stack and is the de facto open TTS for embedded device

License
mit · OK
Context
—
Parakeet TDT 0.6B v2
600M params · NVIDIA
▸ Best-in-class English transcription throughput on NVIDIA GPUs with long-form support

600M-parameter FastConformer-TDT transducer ASR from NVIDIA NeMo. Topped the Hugging Face Open ASR Leaderboard in 2025 for English, with WER ~6.05% averaged across the leaderboard suite. Outputs word/segment timestamps n

License
cc-by-4.0 · OK
Context
—
F5-TTS
336M params · SWivid (Shanghai Jiao Tong)
▸ Research-grade zero-shot voice cloning with state-of-the-art naturalness, Mandarin or English

Flow-matching non-autoregressive TTS built on a Diffusion Transformer (DiT) backbone with ConvNeXt text refinement. Trained on the 100K-hour Emilia dataset; supports zero-shot voice cloning with strong naturalness and lo

License
cc-by-nc-4.0
Context
—
Orpheus 3B 0.1 FT
3B params · Canopy Labs
▸ Expressive, emotion-rich English TTS for agents, NPCs, and audiobooks on a consumer GPU

LLaMA-architecture 3B model fine-tuned as a TTS that emits SNAC audio tokens. Designed for highly expressive, emotion-controllable speech with laughter, sighs, and other paralinguistic markers via inline tags. Apache-2.0

License
apache-2.0 · OK
Context
—
Whisper Large v3
1.55B params · OpenAI
▸ open speech-to-text baseline

OpenAI's flagship open speech-to-text model. 99 languages, MIT license. The de-facto open ASR baseline.

License
MIT · OK
Context
—
Whisper Large v3 Turbo
810M params · OpenAI
▸ real-time / batch transcription

Distilled Whisper Large v3. ~8x faster decode at near-equivalent accuracy on most languages.

License
MIT · OK
Context
—
COVERAGE

Building a local voice pipeline?

Pair an ASR model with a TTS model from this hub for a fully-offline assistant. The runtime guidance per row covers FasterWhisper, WhisperX, mlx-whisper, ONNX, and CPU vs GPU latency for each. See also best GPU for Whisper.