Audio
silero
speech detection

Voice Activity Detection

Detecting speech vs silence in audio streams. Silero VAD is the open-weight default — small, fast, accurate.

Setup walkthrough

  1. pip install silero-vad (Silero VAD — the gold-standard open-weight voice activity detection, ~2 MB).
  2. Python script for real-time VAD:
import torch, pyaudio
model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad", trust_repo=True)
(get_speech_timestamps, _, _, _, _) = utils

CHUNK = 512  # 32ms at 16kHz
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=CHUNK)
while True:
    chunk = stream.read(CHUNK)
    tensor = torch.frombuffer(chunk, dtype=torch.int16).float() / 32768.0
    speech_prob = model(tensor, 16000).item()
    if speech_prob > 0.5:
        print(f"Speech detected: {speech_prob:.2f}")
  1. First detection in <10 ms — Silero VAD is real-time on CPU, 50× faster than real-time in batch mode.
  2. For file-based VAD: pip install silero-vad + from silero_vad import read_audio, get_speech_timestamps → returns start/end times of all speech segments in an audio file.
  3. Use cases: audio pre-processing before STT (skip silence), real-time voice activity for chatbots, meeting recording segmentation, audio stream monitoring.

The cheap setup

VAD is the lightest AI task on this list. Silero VAD (~2 MB) runs at 500-1000× real-time on any laptop CPU. A 1-hour audio file VAD-processes in ~3-6 seconds. No GPU needed. Even a Raspberry Pi 4 ($35) runs Silero VAD in real-time. Any $200 Chromebook handles production VAD. VAD is so computationally trivial that hardware is never the bottleneck — code integration is the only challenge. If you can run Python, you can run VAD. Total cost to add VAD to any project: $0 hardware, 5 lines of Python.

The serious setup

VAD doesn't need "serious" hardware. The model is 2 MB and runs on CPU at 500-1000× real-time. For production audio pipelines processing 1000+ concurrent streams (call center monitoring, broadcast VAD), any modern server CPU (AMD EPYC, Intel Xeon) handles it trivially. The scaling challenge is I/O and audio streaming infrastructure, not compute. Budget for audio routing (Kafka streams, WebRTC) and storage, not GPU. A $500 refurbished server with 32 cores handles 10,000+ simultaneous VAD streams. VAD is the poster child for "you don't need a GPU for AI."

Common beginner mistake

The mistake: Using VAD with default threshold (0.5) and wondering why half the speech is cut off or why background noise triggers false positives. Why it fails: VAD thresholds are environment-dependent. In a quiet studio, speech_prob=0.1 reliably separates speech from silence. In a noisy café, speech_prob=0.7 might be the minimum to avoid false triggers on background chatter. The default threshold is tuned for clean speech in quiet environments. The fix: Tune the threshold for your environment. Record 1 minute of ambient noise + 1 minute of target speech. Plot speech_prob over time. Set the threshold above the highest noise probability and below the lowest speech probability. For dynamic environments (varying noise levels), use Silero VAD's built-in adaptive thresholding or pair with a noise estimator. A one-size threshold fits no environment.

Recommended setup for voice activity detection

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running voice activity detection locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle voice activity detection before committing money.

Specialized buyer guides
Updated 2026 roundup