Whisper Model Selection — Voice AI with Local Models (Chapter 3)

Whisper offers five model sizes: tiny (1GB), base (1GB), small (2GB), medium (5GB), and large (~10GB). Each size trades accuracy against inference speed and memory consumption.

Model selection hinges on deployment context. Desktop applications with permanent GPU access can use large models for highest accuracy. Mobile or embedded deployments require tiny or base models for acceptable performance.

Benchmark transcription speed on target hardware:

import whisper
import time

audio_path = "sample.wav"

for model_size in ["tiny", "base", "small", "medium"]:
    model = whisper.load_model(model_size, device="cuda")
    
    start = time.time()
    result = model.transcribe(audio_path)
    elapsed = time.time() - start
    
    print(f"{model_size}: {elapsed:.2f}s, text: {result['text'][:50]}")
    
    del model
    import torch
    torch.cuda.empty_cache()

The large model uses a different tokenizer for languages beyond English, increasing VRAM requirements. The multilingual model variant includes an additional 1GB for the larger vocabulary.

Mixed-language audio requires the large-vocab tokenizer found in small and larger models. The tiny and base models use constrained vocabularies that may produce poor transcriptions for code-switching scenarios.

Fine-tuning considerations affect model selection. Fine-tuned models require retraining on domain-specific audio. Pre-trained checkpoint availability from community projects provides starting points for adaptation.

Memory-constrained environments can employ quantization. The llama.cpp project provides INT8 quantized Whisperm, but accuracy degradation varies by audio domain. Evaluate transcriptions quality against baseline before deployment.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.