RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 3
Voice AI with Local Models

03. Whisper Model Selection

Chapter 3 of 22 · 15 min
KEY INSIGHT

Model size choice depends on deployment hardware and acceptable word error rate for the specific use case.

Whisper offers five model sizes: tiny (1GB), base (1GB), small (2GB), medium (5GB), and large (~10GB). Each size trades accuracy against inference speed and memory consumption.

Model selection hinges on deployment context. Desktop applications with permanent GPU access can use large models for highest accuracy. Mobile or embedded deployments require tiny or base models for acceptable performance.

Benchmark transcription speed on target hardware:

import whisper
import time

audio_path = "sample.wav"

for model_size in ["tiny", "base", "small", "medium"]:
    model = whisper.load_model(model_size, device="cuda")
    
    start = time.time()
    result = model.transcribe(audio_path)
    elapsed = time.time() - start
    
    print(f"{model_size}: {elapsed:.2f}s, text: {result['text'][:50]}")
    
    del model
    import torch
    torch.cuda.empty_cache()

The large model uses a different tokenizer for languages beyond English, increasing VRAM requirements. The multilingual model variant includes an additional 1GB for the larger vocabulary.

Mixed-language audio requires the large-vocab tokenizer found in small and larger models. The tiny and base models use constrained vocabularies that may produce poor transcriptions for code-switching scenarios.

Fine-tuning considerations affect model selection. Fine-tuned models require retraining on domain-specific audio. Pre-trained checkpoint availability from community projects provides starting points for adaptation.

Memory-constrained environments can employ quantization. The llama.cpp project provides INT8 quantized Whisperm, but accuracy degradation varies by audio domain. Evaluate transcriptions quality against baseline before deployment.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Time transcription of the same 60-second audio file with tiny, base, small, and medium models. Calculate real-time factor (transcription time / audio duration) for each. (15 minutes)

← Chapter 2
Whisper Installation
Chapter 4 →
STT Accuracy Tuning