Whisper Installation — Voice AI with Local Models (Chapter 2)

Whisper audio processing requires Python dependencies, the PyTorch backend, and model weights. Installation complexity stems from CUDA compatibility requirements between PyTorch, the NVIDIA driver, and CUDA toolkit versions.

Begin by verifying CUDA availability:

nvidia-smi

Check CUDA version in the output's top-right corner. Consult the PyTorch compatibility matrix to select the correct installation command. PyTorch 2.x generally supports CUDA 11.8 and 12.1.

Install PyTorch with CUDA support:

pip install torch --index-url https://download.pytorch.org/whl/cu121

Install whisper and audio processing libraries:

pip install openai-whisper pyaudio numpy

The openai-whisper package bundles the transcription model and provides both transcribe and translate functions. The pyaudio library handles microphone input streams.

Test basic transcription functionality:

import whisper

model = whisper.load_model("base")
result = model.transcribe("test_audio.wav")
print(result["text"])

A common failure mode involves FFmpeg absence. Whisper requires FFmpeg for audio format conversion. Install via package manager:

# Debian/Ubuntu
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows (via scoop)
scoop install ffmpeg

Another failure mode involves loading models on insufficient GPU memory. The medium and large models require 5GB and 10GB of VRAM respectively. Fall back to the tiny or base models when debugging memory issues.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.