Voice Activity Detection — Voice AI with Local Models (Chapter 5)

Voice Activity Detection identifies segments containing human speech within audio streams. Streaming voice AI requires VAD to trigger processing only when someone speaks, avoiding continuous analysis.

Silero VAD provides high accuracy with minimal computational overhead. The model operates on short audio chunks and produces speech probability scores. A threshold parameter controls sensitivity.

Create a VAD-enabled audio streamer:

import torch
import numpy as np
import pyaudio
from queue import Queue

class VADAudio:
    def __init__(self, aggressiveness=3):
        self.model, utils = torch.hub.load(
            "snakers4/silero-vad",
            "silero_vad"
        )
        self.get_speech_timestamps = utils[0]
        
        self.sample_rate = 16000
        self.frame_duration = 1536  # ~96ms at 16kHz
        self.aggressiveness = aggressiveness
        self.audio_queue = Queue()
        self.buffer = np.zeros(int(self.sample_rate * 0.5))
    
    def start(self):
        self.stream = pyaudio.PyAudio().open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=512,
            stream_callback=self._callback
        )
        self.stream.start_stream()
    
    def _callback(self, input_data, frame_count, time_info, status):
        audio = np.frombuffer(input_data, dtype=np.int16)
        audio = audio.astype(np.float32) / 32768.0
        
        speech_prob = self.model(
            torch.from_numpy(audio).unsqueeze(0),
            self.sample_rate
        ).item()
        
        self.buffer = np.append(self.buffer, audio)
        if len(self.buffer) > self.sample_rate * 5:
            self.buffer = self.buffer[-self.sample_rate * 5:]
        
        return (input_data, pyaudio.paContinue)
    
    def stop(self):
        self.stream.stop_stream()
        self.stream.close()

vad = VADAudio(aggressiveness=3)
vad.start()

The aggressiveness parameter (0-3) controls per-frame suppression. Higher values reject more non-speech frames. Experiment to find the setting matching background noise levels.

Chunk-based processing introduces latency. Longer chunks improve accuracy but increase response time to speech onset. The 1536-sample frame provides reasonable balance for conversational applications.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.