Audio
speaker identification
speaker separation

Speaker Diarization

Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.

Setup walkthrough

  1. pip install pyannote.audio (PyAnnote — the standard open-weight speaker diarization library).
  2. Accept PyAnnote's license on HuggingFace (huggingface.co/pyannote/speaker-diarization-3.1) and generate an access token.
  3. Python script:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("meeting.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
  1. First diarization result in 10-30 seconds for a 10-minute meeting on CPU. The model processes audio at ~20-50× real-time.
  2. For the complete pipeline (who said what): combine PyAnnote diarization + WhisperX transcription → WhisperX aligns transcript words to timestamps → match timestamps to PyAnnote speaker segments → labeled transcript.
  3. Pip install: pip install whisperx integrates both steps: whisperx meeting.wav --model large-v3 --diarize --hf_token YOUR_TOKEN.

The cheap setup

PyAnnote diarization runs entirely on CPU at 20-50× real-time. Any $300 laptop diarizes a 1-hour meeting in 2-3 minutes. WhisperX + diarization (full pipeline) adds the STT step: 3-5 minutes per hour on CPU (Whisper large-v3 on CPU is slower). For GPU-accelerated STT: a used GTX 1060 6 GB ($60) drops WhisperX to 3-5× real-time — a 1-hour meeting transcribes + diarizes in ~15 minutes. Total build: ~$360. Diarization is CPU-friendly; the STT stage benefits from GPU.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles the full WhisperX + diarization pipeline. A 1-hour meeting transcribes + diarizes in ~5-8 minutes (Whisper large-v3 at 15-20× real-time + diarization at 20-50×). For meeting-intelligence platforms processing 100+ hours/day: batch pipeline on multiple GPUs. Total build: ~$700-900. For enterprise meeting transcription (Zoom/Teams integration), the compute is light — budget for audio storage and the integration layer. Diarization accuracy plateaus at ~90-95% regardless of GPU — model improvements, not hardware, are the bottleneck.

Common beginner mistake

The mistake: Running diarization on a meeting recording, getting "Speaker 1, Speaker 2, Speaker 3" labels, and presenting it as "automated meeting notes with speaker identification." Why it fails: PyAnnote identifies speech segments and clusters them by voice similarity. It labels them SPEAKER_00, SPEAKER_01 — not "Alice, Bob, Charlie." The diarizer doesn't know names. If you present generic labels as identification, you haven't identified anyone. The fix: Add a speaker enrollment step. Record 30 seconds of each participant speaking (or extract from previous meetings). Enroll these voice prints. When diarizing, compare each speaker cluster to the enrolled voice prints → map SPEAKER_00 → "Alice." Without enrollment, you have speaker separation (who spoke when), not speaker identification (who is who). Separation is useful; identification requires enrollment.

Recommended setup for speaker diarization

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running speaker diarization locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle speaker diarization before committing money.

Specialized buyer guides
Updated 2026 roundup