Audio
transcription
asr
voice to text
whisper

Speech-to-Text (STT)

Transcribing spoken audio into text. Whisper family is the open-weight default; faster-whisper + WhisperX deliver production-grade speed.

Capability notes

OpenAI's [Whisper](/tools/whisper-cpp) family dominates open-weight STT in 2026 across six sizes: tiny (39M params), base (74M), small (244M), medium (769M), large-v2 (1.55B), and large-v3 (1.55B). Large-v3 achieves WER of 8.1% on English LibriSpeech clean — competitive with commercial APIs (Deepgram Nova-2 = 7.2%, AssemblyAI = 7.8%, Google Cloud = 8.5%). For non-English, WER varies: Spanish 5.8%, French 7.3%, German 8.1%, Japanese 9.2%, Mandarin 12.4%, Hindi 15.1%, Arabic 16.8%. Model size reflects a quality-speed tradeoff. Tiny/base run at 200–500× real-time on CPU for simple use cases (voice commands, keyword detection) at WER 15–25%. Small/medium at 50–150× real-time on CPU cover 90% of production use cases — podcast transcription, meeting notes, voicemail-to-text — at WER 10–15%. Large-v3 at 5–15× real-time on CPU delivers best accuracy at WER 7–9% for English. The practical insight: medium.en is the sweet spot — WER within 1–2% of large-v3 for English at 3–4× the speed and half the memory. Real-time factor (RTF) — processing time ÷ audio duration — determines viability for live transcription. RTF < 1.0 means faster than real-time. Whisper large-v3 on CPU achieves RTF ~0.05–0.1 (10–20× faster than real-time). On [GPU](/hardware/rtx-4060-ti-16gb), RTF drops to 0.01–0.02 (50–100× real-time). Live transcription with sub-second latency is achievable on any GPU-accelerated deployment. The diarization gap: Whisper transcribes what was said, not who. Multi-speaker audio requires a separate diarization model (pyannote.audio) chained with Whisper: diarization identifies speaker turn boundaries → Whisper transcribes each segment → post-processing assigns speaker labels. This adds 20–30% processing time and introduces diarization errors (15–18% DER on AMI corpus) that compound transcription errors.

If you just want to try this

Lowest-friction path to a working setup.

Install whisper.cpp (C++ port of OpenAI Whisper, no Python dependencies) and download a GGUF model: ```bash git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp && make bash ./models/download-ggml-model.sh small.en ./main -m models/ggml-small.en.bin -f ~/recording.wav --output-txt ``` The small.en model (466 MB) transcribes at ~200× real-time on any modern CPU — a 30-minute recording in ~9 seconds. For better accuracy, step up to medium.en (1.5 GB, ~100× real-time on CPU). If you have a GPU (4 GB+ VRAM), offload all layers: `./main -m models/ggml-medium.en.bin -f ~/recording.wav -ngl 99`. This accelerates to ~500–800× real-time on an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) — 30 minutes in 2–4 seconds. For a GUI with zero terminal: install [LM Studio](/tools/lm-studio), search "whisper" in the model browser, download whisper-large-v3 GGUF, use the built-in audio transcription panel. LM Studio accepts WAV/MP3/OGG input and displays transcription in real time.

For production deployment

Operator-grade recommendation.

Production ASR requires choosing streaming (live, sub-second latency) vs batch (recorded, throughput-optimized). **Streaming ASR.** Use faster-whisper (CTranslate2 backend) with chunked audio. Pipeline: audio chunks via WebSocket → Silero VAD segments speech vs silence → 5–10 second chunks → faster-whisper transcribes → pyannote.audio diarization assigns speaker labels → merged transcription streamed. On an [RTX 4090](/hardware/rtx-4090), sub-500ms latency for 10-second chunks. One GPU handles 3–5 concurrent streams. **Batch ASR.** whisper.cpp with full GPU offload. Single [RTX 4090](/hardware/rtx-4090): ~40 hours of audio/hour at large-v3 quality — RTF ~0.01. For high-volume (call centers, podcast archives), deploy multiple workers behind a job queue. A 4× [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) server (~$2,000): ~60 hours/hour at large-v3 — 1,000-hour archive in ~17 hours. **Multi-speaker handling.** Production diarization: Silero VAD → pyannote.audio 3.1 assigns speaker labels → faster-whisper transcribes each labeled segment → merge into unified transcript. DER: 14–18% on AMI meetings, 5–10% on clean single-mic audio. For 2–3 speakers: ~85–90% speaker attribution accuracy. For 5+ speakers: 60–75%. **API vs self-host.** APIs charge $0.003–0.010/audio minute. At 10,000 minutes/month, API is $30–100. At 1,000,000+ minutes/month, self-hosting on 4× GPU servers (~$400–600/month) saves 85–95%. Self-hosting also eliminates audio data egress for compliance. Add language model rescoring post-Whisper: KenLM or GPT-based LM scores hypotheses for linguistic plausibility, reducing WER by 5–15% on domain-specific audio.

What breaks

Failure modes operators see in the wild.

**Code-switching degradation.** Symptom: WER jumps from 8% to 15–25% when speakers mix languages mid-sentence. Cause: Whisper's decoder tokenizes into language-specific tokens — mid-utterance shifts cause prediction oscillation. Mitigation: pre-process with language ID, split at language-switch boundaries, transcribe each with the corresponding model. Fine-tune on bilingual data — reduces WER by 15–30%. **Accent bias.** Symptom: WER 3–5× higher for non-General American English. Large-v3 WER by accent: General American 7.5%, Indian English 16.4%, Scottish 18.2%. Cause: 65% training audio is American/British despite <20% of global English speakers. Mitigation: fine-tune on accent-specific datasets (Common Voice), deploy accent-specific variants by user locale. **Background noise failure.** Symptom: WER increases 2–4× at SNR below 15 dB; >50% WER below 5 dB. Cause: Whisper trained on clean speech — fails below 10 dB SNR. Mitigation: pre-process with noise suppression (RNNoise, DeepFilterNet — ~1–3 MB). Reduces WER at 10 dB SNR from 30% to 15%. For consistently noisy environments, multi-mic beamforming improves SNR by 8–12 dB. **Long-audio drift.** Symptom: hour 1 at 8% WER, hour 3 at 12%, hour 5 at 18%. Cause: cross-attention context extends ~30 seconds — domain terms from earlier audio are unavailable. Mitigation: consistent 10-second VAD segments with 1-second overlap, extract domain vocabulary from first 10 minutes as prompt prefix, reset context every 60 minutes. **Hallucinated text in silence.** Symptom: coherent sentences transcribed from near-silent audio. Cause: decoder's language prior dominates when acoustic signal is weak. Mitigation: use Silero VAD before Whisper, set speech probability >= 0.5, drop segments where average log-probability is below -1.0. Hallucinated segments score -0.5 to -0.2; genuine speech scores 0 to -0.3.

Hardware guidance

Whisper transcription is CPU-viable for all model sizes. The question is throughput, not capability. **CPU-only tier ($0).** Large-v3 on modern desktop CPU (Apple M4, Ryzen 9 7950X, Core Ultra 9) at 5–10× real-time — 1-hour in 6–12 minutes. Medium.en: 20–30× real-time. Small.en: 50–100× real-time. CPU is practical under 10 hours/day. [Apple M4 Pro](/hardware/apple-m4-pro) and [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) benefit from Neural Engine acceleration — 15–20× real-time, ~2× faster than x86. **Entry GPU tier ($300–600).** Any GPU with 4 GB+ VRAM accelerates 5–10× over CPU. [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs large-v3 at 50–80× real-time — 1-hour in 45–75 seconds. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at 60–100× real-time. [Intel Arc B580](/hardware/intel-arc-b580) via SYCL: 30–50× real-time. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 100–150× real-time — 1-hour in ~25 seconds. [RTX 5070](/hardware/rtx-5070) at 12 GB: $/throughput pick at 80–120× real-time. [Apple M3 Ultra](/hardware/apple-m3-ultra): 5–8 concurrent large-v3 streams. **Enterprise tier ($8,000+).** Enterprise GPUs are over-provisioned. For 100+ concurrent streams, deploy many small GPUs: 8× [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at ~$4,000 for 8× concurrent vs 1× [H100 PCIe](/hardware/nvidia-h100-pcie) at ~$25,000 for 1 stream. Horizontal beats vertical. **Memory scaling.** Model VRAM: tiny = 150 MB, medium = 3.1 GB, large-v3 = 6.2 GB (FP16). Any 8 GB+ GPU runs the largest Whisper model. Apple Neural Engine via CoreML reduces large-v3 to ~4 GB — any M-series Mac runs on-device near-real-time.

Runtime guidance

**whisper.cpp vs faster-whisper vs WhisperX — speed and feature tradeoffs.** whisper.cpp is a C/C++ port with no Python, no PyTorch, no CUDA toolkit. It runs on CPU with SIMD (AVX2/NEON) and GPU offload via CUDA/Metal/Vulkan/SYCL/ROCm through one binary. Performance: large-v3 on [RTX 4090](/hardware/rtx-4090) = ~100× real-time. On CPU = ~8× real-time. Tradeoffs: no streaming API, no diarization, no VAD — raw transcription. Use for batch processing with minimal setup. faster-whisper wraps CTranslate2 for INT8-quantized inference, outperforming whisper.cpp by 15–25% on GPU and 30–50% on CPU. Large-v3 on [RTX 4090](/hardware/rtx-4090) = ~150× real-time. Supports streaming via chunked input with word-level timestamps for real-time captioning. Tradeoffs: Python dependency, 5–10 minute install, 50–100ms/request overhead vs whisper.cpp's sub-10ms startup. WhisperX extends faster-whisper with forced phoneme alignment (wav2vec2) for word-level timestamps and optional speaker diarization (pyannote.audio). Outputs each word with start/end time and speaker label — the complete production data structure. On [RTX 4090](/hardware/rtx-4090): ~80× real-time for full pipeline. Tradeoffs: phoneme alignment adds 300–500 MB VRAM and 20–30% time; diarization adds ~100 MB and 15–25%. **Decision tree.** Batch transcription: whisper.cpp → medium.en GGUF → `./main -f audio.wav`. Live/streaming: faster-whisper → chunked WebSocket input → word timestamps → real-time captions. Full meeting transcripts: WhisperX → transcription + diarization → per-speaker segments. Batch throughput: whisper.cpp with full GPU offload behind Redis job queue. All three output standard formats (SRT/VTT/JSON). whisper.cpp via `--output-srt`; faster-whisper via Python JSON serialization; WhisperX via JSONL with word timestamps and speaker labels. Choose based on downstream format requirements.

Setup walkthrough

  1. pip install faster-whisper (CTranslate2-based Whisper — 4× faster than openai-whisper).
  2. Download a model: faster-whisper-large-v3 will auto-download on first use (~3 GB).
  3. CLI usage: create a script transcribe.py:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("meeting.mp3")
for seg in segments:
    print(f"[{seg.start:.1f}s-{seg.end:.1f}s] {seg.text}")
  1. Run: python transcribe.py. First transcription within seconds.
  2. CPU-only: use device="cpu", compute_type="int8" — 2-3× slower but works on any laptop.
  3. For word-level timestamps, swap to WhisperX (pip install whisperx).

The cheap setup

Whisper large-v3 runs on CPU (Intel i5/Ryzen 5) at 0.2-0.5× real-time — a 1-hour meeting transcribes in 2-5 hours, fine for overnight batch jobs. A used GTX 1060 6 GB ($60) speeds this to 3-5× real-time via CUDA — a 1-hour meeting in 12-20 minutes. Any $300 laptop + a cheap USB microphone handles this. For real-time transcription, add a used GTX 1660 Super 6 GB ($100) for ~10× real-time on Whisper medium.en.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Whisper large-v3 at 15-20× real-time — a 1-hour meeting in ~3-4 minutes. WhisperX with diarization (speaker labels) at ~8-10× real-time. Can handle 8+ concurrent transcription streams. Pair with Ryzen 7 7700X + 32 GB DDR5 + 2TB NVMe. Total: ~$900-1,100. STT is compute-light — even an RTX 3060 is overkill for most users; prioritize storage for audio archives.

Common beginner mistake

The mistake: Using the original openai-whisper package with default settings and wondering why transcription takes 10× longer than expected. Why it fails: openai-whisper runs on PyTorch with no optimization — it's the reference implementation, not production-grade. The fix: Install faster-whisper (CTranslate2 backend) — identical model weights, 4× faster, 2× less VRAM. And always set compute_type="int8" for CPU or "float16" for GPU. The model choice matters too: tiny.en or base.en handle clean English audio at 10-50× real-time with 90%+ accuracy. Reserve large-v3 for noisy/multilingual audio.

Recommended setup for speech-to-text (stt)

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running speech-to-text (stt) locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle speech-to-text (stt) before committing money.

Hardware buying guidance for Speech-to-Text (STT)

Transcription and audio workloads are unusually low on VRAM — most buyers overspend here. The guides below frame the right tier honestly.

Specialized buyer guides
Updated 2026 roundup