Natural language processing

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) converts spoken audio into text. Operators encounter ASR when running models like Whisper locally to transcribe meetings, voice notes, or live audio. ASR models process audio waveforms or spectrograms and output text tokens. Latency and VRAM usage depend on model size (e.g., Whisper tiny vs. large) and whether the model runs on GPU or CPU. Real-time transcription requires low latency, typically under 300 ms per utterance.

Deeper dive

ASR systems typically consist of an acoustic model, a language model, and a decoder. The acoustic model maps audio features to phonemes or subword units; the language model predicts text sequences; the decoder combines them to produce the final transcription. Modern end-to-end models like OpenAI Whisper use a single transformer that directly maps audio spectrograms to text tokens. Whisper supports multiple languages and punctuation. Operators can run Whisper via llama.cpp, Ollama, or Hugging Face Transformers. Key parameters include model size (tiny: ~1 GB VRAM, large-v3: ~6 GB VRAM), beam size (higher improves accuracy but increases latency), and language detection. For real-time use, smaller models with greedy decoding are preferred. ASR quality degrades with background noise, overlapping speech, or accents not well-represented in training data.

Practical example

An operator with an RTX 3060 (12 GB VRAM) can run Whisper large-v3 at FP16 (~6 GB VRAM) with a beam size of 5, achieving ~2x real-time speed (transcribing 1 minute of audio in ~30 seconds). Using Whisper tiny (FP16, ~1 GB VRAM) on the same GPU yields ~10x real-time speed but with lower accuracy on accented speech.

Workflow example

In LM Studio, an operator loads a Whisper model (e.g., 'whisper-large-v3') and selects an audio file or microphone input. The UI displays transcribed text in real-time. In Ollama, the command ollama run whisper-large-v3 starts an interactive session where audio input via microphone is transcribed. In Python with Hugging Face Transformers: from transformers import pipeline; transcriber = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3'); result = transcriber('audio.mp3').

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work