05. Voice Activity Detection
Voice Activity Detection identifies segments containing human speech within audio streams. Streaming voice AI requires VAD to trigger processing only when someone speaks, avoiding continuous analysis.
Silero VAD provides high accuracy with minimal computational overhead. The model operates on short audio chunks and produces speech probability scores. A threshold parameter controls sensitivity.
Create a VAD-enabled audio streamer:
import torch
import numpy as np
import pyaudio
from queue import Queue
class VADAudio:
def __init__(self, aggressiveness=3):
self.model, utils = torch.hub.load(
"snakers4/silero-vad",
"silero_vad"
)
self.get_speech_timestamps = utils[0]
self.sample_rate = 16000
self.frame_duration = 1536 # ~96ms at 16kHz
self.aggressiveness = aggressiveness
self.audio_queue = Queue()
self.buffer = np.zeros(int(self.sample_rate * 0.5))
def start(self):
self.stream = pyaudio.PyAudio().open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=512,
stream_callback=self._callback
)
self.stream.start_stream()
def _callback(self, input_data, frame_count, time_info, status):
audio = np.frombuffer(input_data, dtype=np.int16)
audio = audio.astype(np.float32) / 32768.0
speech_prob = self.model(
torch.from_numpy(audio).unsqueeze(0),
self.sample_rate
).item()
self.buffer = np.append(self.buffer, audio)
if len(self.buffer) > self.sample_rate * 5:
self.buffer = self.buffer[-self.sample_rate * 5:]
return (input_data, pyaudio.paContinue)
def stop(self):
self.stream.stop_stream()
self.stream.close()
vad = VADAudio(aggressiveness=3)
vad.start()
The aggressiveness parameter (0-3) controls per-frame suppression. Higher values reject more non-speech frames. Experiment to find the setting matching background noise levels.
Chunk-based processing introduces latency. Longer chunks improve accuracy but increase response time to speech onset. The 1536-sample frame provides reasonable balance for conversational applications.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement logging to record speech probability scores and visualize them over time while speaking and remaining silent. Identify optimal threshold cutoff. (15 minutes)