Speaker Diarization
Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.
Setup walkthrough
pip install pyannote.audio(PyAnnote — the standard open-weight speaker diarization library).- Accept PyAnnote's license on HuggingFace (huggingface.co/pyannote/speaker-diarization-3.1) and generate an access token.
- Python script:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("meeting.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
- First diarization result in 10-30 seconds for a 10-minute meeting on CPU. The model processes audio at ~20-50× real-time.
- For the complete pipeline (who said what): combine PyAnnote diarization + WhisperX transcription → WhisperX aligns transcript words to timestamps → match timestamps to PyAnnote speaker segments → labeled transcript.
- Pip install:
pip install whisperxintegrates both steps:whisperx meeting.wav --model large-v3 --diarize --hf_token YOUR_TOKEN.
The cheap setup
PyAnnote diarization runs entirely on CPU at 20-50× real-time. Any $300 laptop diarizes a 1-hour meeting in 2-3 minutes. WhisperX + diarization (full pipeline) adds the STT step: 3-5 minutes per hour on CPU (Whisper large-v3 on CPU is slower). For GPU-accelerated STT: a used GTX 1060 6 GB ($60) drops WhisperX to 3-5× real-time — a 1-hour meeting transcribes + diarizes in ~15 minutes. Total build: ~$360. Diarization is CPU-friendly; the STT stage benefits from GPU.
The serious setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles the full WhisperX + diarization pipeline. A 1-hour meeting transcribes + diarizes in ~5-8 minutes (Whisper large-v3 at 15-20× real-time + diarization at 20-50×). For meeting-intelligence platforms processing 100+ hours/day: batch pipeline on multiple GPUs. Total build: ~$700-900. For enterprise meeting transcription (Zoom/Teams integration), the compute is light — budget for audio storage and the integration layer. Diarization accuracy plateaus at ~90-95% regardless of GPU — model improvements, not hardware, are the bottleneck.
Common beginner mistake
The mistake: Running diarization on a meeting recording, getting "Speaker 1, Speaker 2, Speaker 3" labels, and presenting it as "automated meeting notes with speaker identification." Why it fails: PyAnnote identifies speech segments and clusters them by voice similarity. It labels them SPEAKER_00, SPEAKER_01 — not "Alice, Bob, Charlie." The diarizer doesn't know names. If you present generic labels as identification, you haven't identified anyone. The fix: Add a speaker enrollment step. Record 30 seconds of each participant speaking (or extract from previous meetings). Enroll these voice prints. When diarizing, compare each speaker cluster to the enrolled voice prints → map SPEAKER_00 → "Alice." Without enrollment, you have speaker separation (who spoke when), not speaker identification (who is who). Separation is useful; identification requires enrollment.
Recommended setup for speaker diarization
Browse all tools for runtimes that fit this workload.
Reality check
Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.
Common mistakes
- Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
- Running audio + LLM concurrently without budgeting VRAM
- Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
- Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected
What breaks first
The errors most operators hit when running speaker diarization locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle speaker diarization before committing money.