02. Whisper Installation
Whisper audio processing requires Python dependencies, the PyTorch backend, and model weights. Installation complexity stems from CUDA compatibility requirements between PyTorch, the NVIDIA driver, and CUDA toolkit versions.
Begin by verifying CUDA availability:
nvidia-smi
Check CUDA version in the output's top-right corner. Consult the PyTorch compatibility matrix to select the correct installation command. PyTorch 2.x generally supports CUDA 11.8 and 12.1.
Install PyTorch with CUDA support:
pip install torch --index-url https://download.pytorch.org/whl/cu121
Install whisper and audio processing libraries:
pip install openai-whisper pyaudio numpy
The openai-whisper package bundles the transcription model and provides both transcribe and translate functions. The pyaudio library handles microphone input streams.
Test basic transcription functionality:
import whisper
model = whisper.load_model("base")
result = model.transcribe("test_audio.wav")
print(result["text"])
A common failure mode involves FFmpeg absence. Whisper requires FFmpeg for audio format conversion. Install via package manager:
# Debian/Ubuntu
sudo apt update && sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows (via scoop)
scoop install ffmpeg
Another failure mode involves loading models on insufficient GPU memory. The medium and large models require 5GB and 10GB of VRAM respectively. Fall back to the tiny or base models when debugging memory issues.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Install Whisper and dependencies. Transcribe a 30-second audio clip. Verify GPU acceleration is active by monitoring nvidia-smi during transcription. (15 minutes)