Capability notes
OpenAI's [Whisper](/tools/whisper-cpp) family dominates open-weight STT in 2026 across six sizes: tiny (39M params), base (74M), small (244M), medium (769M), large-v2 (1.55B), and large-v3 (1.55B). Large-v3 achieves WER of 8.1% on English LibriSpeech clean — competitive with commercial APIs (Deepgram Nova-2 = 7.2%, AssemblyAI = 7.8%, Google Cloud = 8.5%). For non-English, WER varies: Spanish 5.8%, French 7.3%, German 8.1%, Japanese 9.2%, Mandarin 12.4%, Hindi 15.1%, Arabic 16.8%.
Model size reflects a quality-speed tradeoff. Tiny/base run at 200–500× real-time on CPU for simple use cases (voice commands, keyword detection) at WER 15–25%. Small/medium at 50–150× real-time on CPU cover 90% of production use cases — podcast transcription, meeting notes, voicemail-to-text — at WER 10–15%. Large-v3 at 5–15× real-time on CPU delivers best accuracy at WER 7–9% for English. The practical insight: medium.en is the sweet spot — WER within 1–2% of large-v3 for English at 3–4× the speed and half the memory.
Real-time factor (RTF) — processing time ÷ audio duration — determines viability for live transcription. RTF < 1.0 means faster than real-time. Whisper large-v3 on CPU achieves RTF ~0.05–0.1 (10–20× faster than real-time). On [GPU](/hardware/rtx-4060-ti-16gb), RTF drops to 0.01–0.02 (50–100× real-time). Live transcription with sub-second latency is achievable on any GPU-accelerated deployment.
The diarization gap: Whisper transcribes what was said, not who. Multi-speaker audio requires a separate diarization model (pyannote.audio) chained with Whisper: diarization identifies speaker turn boundaries → Whisper transcribes each segment → post-processing assigns speaker labels. This adds 20–30% processing time and introduces diarization errors (15–18% DER on AMI corpus) that compound transcription errors.
If you just want to try this
Lowest-friction path to a working setup.
Install whisper.cpp (C++ port of OpenAI Whisper, no Python dependencies) and download a GGUF model:
```bash
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
bash ./models/download-ggml-model.sh small.en
./main -m models/ggml-small.en.bin -f ~/recording.wav --output-txt
```
The small.en model (466 MB) transcribes at ~200× real-time on any modern CPU — a 30-minute recording in ~9 seconds. For better accuracy, step up to medium.en (1.5 GB, ~100× real-time on CPU).
If you have a GPU (4 GB+ VRAM), offload all layers: `./main -m models/ggml-medium.en.bin -f ~/recording.wav -ngl 99`. This accelerates to ~500–800× real-time on an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) — 30 minutes in 2–4 seconds.
For a GUI with zero terminal: install [LM Studio](/tools/lm-studio), search "whisper" in the model browser, download whisper-large-v3 GGUF, use the built-in audio transcription panel. LM Studio accepts WAV/MP3/OGG input and displays transcription in real time.
For production deployment
Operator-grade recommendation.
Production ASR requires choosing streaming (live, sub-second latency) vs batch (recorded, throughput-optimized).
**Streaming ASR.** Use faster-whisper (CTranslate2 backend) with chunked audio. Pipeline: audio chunks via WebSocket → Silero VAD segments speech vs silence → 5–10 second chunks → faster-whisper transcribes → pyannote.audio diarization assigns speaker labels → merged transcription streamed. On an [RTX 4090](/hardware/rtx-4090), sub-500ms latency for 10-second chunks. One GPU handles 3–5 concurrent streams.
**Batch ASR.** whisper.cpp with full GPU offload. Single [RTX 4090](/hardware/rtx-4090): ~40 hours of audio/hour at large-v3 quality — RTF ~0.01. For high-volume (call centers, podcast archives), deploy multiple workers behind a job queue. A 4× [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) server (~$2,000): ~60 hours/hour at large-v3 — 1,000-hour archive in ~17 hours.
**Multi-speaker handling.** Production diarization: Silero VAD → pyannote.audio 3.1 assigns speaker labels → faster-whisper transcribes each labeled segment → merge into unified transcript. DER: 14–18% on AMI meetings, 5–10% on clean single-mic audio. For 2–3 speakers: ~85–90% speaker attribution accuracy. For 5+ speakers: 60–75%.
**API vs self-host.** APIs charge $0.003–0.010/audio minute. At 10,000 minutes/month, API is $30–100. At 1,000,000+ minutes/month, self-hosting on 4× GPU servers (~$400–600/month) saves 85–95%. Self-hosting also eliminates audio data egress for compliance. Add language model rescoring post-Whisper: KenLM or GPT-based LM scores hypotheses for linguistic plausibility, reducing WER by 5–15% on domain-specific audio.
What breaks
Failure modes operators see in the wild.
**Code-switching degradation.** Symptom: WER jumps from 8% to 15–25% when speakers mix languages mid-sentence. Cause: Whisper's decoder tokenizes into language-specific tokens — mid-utterance shifts cause prediction oscillation. Mitigation: pre-process with language ID, split at language-switch boundaries, transcribe each with the corresponding model. Fine-tune on bilingual data — reduces WER by 15–30%.
**Accent bias.** Symptom: WER 3–5× higher for non-General American English. Large-v3 WER by accent: General American 7.5%, Indian English 16.4%, Scottish 18.2%. Cause: 65% training audio is American/British despite <20% of global English speakers. Mitigation: fine-tune on accent-specific datasets (Common Voice), deploy accent-specific variants by user locale.
**Background noise failure.** Symptom: WER increases 2–4× at SNR below 15 dB; >50% WER below 5 dB. Cause: Whisper trained on clean speech — fails below 10 dB SNR. Mitigation: pre-process with noise suppression (RNNoise, DeepFilterNet — ~1–3 MB). Reduces WER at 10 dB SNR from 30% to 15%. For consistently noisy environments, multi-mic beamforming improves SNR by 8–12 dB.
**Long-audio drift.** Symptom: hour 1 at 8% WER, hour 3 at 12%, hour 5 at 18%. Cause: cross-attention context extends ~30 seconds — domain terms from earlier audio are unavailable. Mitigation: consistent 10-second VAD segments with 1-second overlap, extract domain vocabulary from first 10 minutes as prompt prefix, reset context every 60 minutes.
**Hallucinated text in silence.** Symptom: coherent sentences transcribed from near-silent audio. Cause: decoder's language prior dominates when acoustic signal is weak. Mitigation: use Silero VAD before Whisper, set speech probability >= 0.5, drop segments where average log-probability is below -1.0. Hallucinated segments score -0.5 to -0.2; genuine speech scores 0 to -0.3.
Hardware guidance
Whisper transcription is CPU-viable for all model sizes. The question is throughput, not capability.
**CPU-only tier ($0).** Large-v3 on modern desktop CPU (Apple M4, Ryzen 9 7950X, Core Ultra 9) at 5–10× real-time — 1-hour in 6–12 minutes. Medium.en: 20–30× real-time. Small.en: 50–100× real-time. CPU is practical under 10 hours/day. [Apple M4 Pro](/hardware/apple-m4-pro) and [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) benefit from Neural Engine acceleration — 15–20× real-time, ~2× faster than x86.
**Entry GPU tier ($300–600).** Any GPU with 4 GB+ VRAM accelerates 5–10× over CPU. [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs large-v3 at 50–80× real-time — 1-hour in 45–75 seconds. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at 60–100× real-time. [Intel Arc B580](/hardware/intel-arc-b580) via SYCL: 30–50× real-time.
**SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 100–150× real-time — 1-hour in ~25 seconds. [RTX 5070](/hardware/rtx-5070) at 12 GB: $/throughput pick at 80–120× real-time. [Apple M3 Ultra](/hardware/apple-m3-ultra): 5–8 concurrent large-v3 streams.
**Enterprise tier ($8,000+).** Enterprise GPUs are over-provisioned. For 100+ concurrent streams, deploy many small GPUs: 8× [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at ~$4,000 for 8× concurrent vs 1× [H100 PCIe](/hardware/nvidia-h100-pcie) at ~$25,000 for 1 stream. Horizontal beats vertical.
**Memory scaling.** Model VRAM: tiny = 150 MB, medium = 3.1 GB, large-v3 = 6.2 GB (FP16). Any 8 GB+ GPU runs the largest Whisper model. Apple Neural Engine via CoreML reduces large-v3 to ~4 GB — any M-series Mac runs on-device near-real-time.
Runtime guidance
**whisper.cpp vs faster-whisper vs WhisperX — speed and feature tradeoffs.**
whisper.cpp is a C/C++ port with no Python, no PyTorch, no CUDA toolkit. It runs on CPU with SIMD (AVX2/NEON) and GPU offload via CUDA/Metal/Vulkan/SYCL/ROCm through one binary. Performance: large-v3 on [RTX 4090](/hardware/rtx-4090) = ~100× real-time. On CPU = ~8× real-time. Tradeoffs: no streaming API, no diarization, no VAD — raw transcription. Use for batch processing with minimal setup.
faster-whisper wraps CTranslate2 for INT8-quantized inference, outperforming whisper.cpp by 15–25% on GPU and 30–50% on CPU. Large-v3 on [RTX 4090](/hardware/rtx-4090) = ~150× real-time. Supports streaming via chunked input with word-level timestamps for real-time captioning. Tradeoffs: Python dependency, 5–10 minute install, 50–100ms/request overhead vs whisper.cpp's sub-10ms startup.
WhisperX extends faster-whisper with forced phoneme alignment (wav2vec2) for word-level timestamps and optional speaker diarization (pyannote.audio). Outputs each word with start/end time and speaker label — the complete production data structure. On [RTX 4090](/hardware/rtx-4090): ~80× real-time for full pipeline. Tradeoffs: phoneme alignment adds 300–500 MB VRAM and 20–30% time; diarization adds ~100 MB and 15–25%.
**Decision tree.** Batch transcription: whisper.cpp → medium.en GGUF → `./main -f audio.wav`. Live/streaming: faster-whisper → chunked WebSocket input → word timestamps → real-time captions. Full meeting transcripts: WhisperX → transcription + diarization → per-speaker segments. Batch throughput: whisper.cpp with full GPU offload behind Redis job queue.
All three output standard formats (SRT/VTT/JSON). whisper.cpp via `--output-srt`; faster-whisper via Python JSON serialization; WhisperX via JSONL with word timestamps and speaker labels. Choose based on downstream format requirements.