RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Voice AI with Local Models
  6. /Ch. 4
Voice AI with Local Models

04. STT Accuracy Tuning

Chapter 4 of 22 · 20 min
KEY INSIGHT

Audio preprocessing and language specification provide accuracy gains equal to model size upgrades at zero computational cost.

Transcription accuracy depends on audio preprocessing, language specification, and decoding parameters. Whisper's transcription function accepts configuration options that significantly affect output quality.

Audio sample rate matters. Whisper models expect 16kHz mono input. Higher sample rates consume more memory without accuracy benefit. Convert audio to the expected format:

import librosa

def preprocess_audio(audio_path, target_sr=16000):
    audio, sr = librosa.load(audio_path, sr=target_sr, mono=True)
    return audio.astype(np.float32)

Specifying the language improves accuracy by constraining the vocabulary:

result = model.transcribe(
    audio,
    language="english",
    initial_prompt="This is a technical conversation about AI systems.",
    temperature=0.0,
    condition_on_previous_text=True
)

The initial_prompt parameter provides context for the model to calibrate its internal language model. This helps with domain-specific terminology and improves consistency across long recordings.

Voice Activity Detection (VAD) separates speech segments from silence and noise. Whisper includes a simple energy-based VAD, but specialized VAD models improve performance:

import torch
from functools import cached_property

class SileroVAD:
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self._model = None
    
    @cached_property
    def model(self):
        model = torch.jit.load("silero-vad.jit")
        model.eval()
        return model
    
    def get_speech_timestamps(self, audio, sample_rate=16000):
        speech = self.model(
            torch.from_numpy(audio).float(),
            sample_rate,
            threshold=self.threshold
        )
        return speech[0]["timestamp"]

Common accuracy issues include homophones (there/their/they're), proper nouns, and technical terms. Address homophones through context-aware post-processing. Handle proper nouns by adding them to a custom vocabulary and using beam search with a temperature of zero.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Compare transcription accuracy with and without language specification and initial_prompt on a domain-specific audio clip. Count and categorize errors. (10 minutes)

← Chapter 3
Whisper Model Selection
Chapter 5 →
Voice Activity Detection