RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Specialized domains / Speech Processing
Specialized domains

Speech Processing

Speech processing refers to the analysis, synthesis, and manipulation of human speech by AI models. Operators encounter it when running automatic speech recognition (ASR) to transcribe audio, text-to-speech (TTS) to generate spoken output, or voice activity detection (VAD) to segment audio streams. On local hardware, these tasks are typically handled by specialized models like Whisper (ASR) or Coqui TTS, which must fit within available VRAM and meet real-time latency requirements. Speech processing pipelines often combine multiple models—e.g., VAD to isolate speech, ASR to transcribe, then a language model to interpret—each adding latency and memory overhead.

Deeper dive

Speech processing encompasses several sub-tasks: automatic speech recognition (ASR) converts audio waveforms into text; text-to-speech (TTS) generates audio from text; voice activity detection (VAD) identifies speech segments; and speaker diarization assigns speech to different speakers. Modern local approaches use transformer-based models like Whisper (ASR) or VITS (TTS), which run on GPU or CPU via llama.cpp, Ollama, or Hugging Face Transformers. Operators must consider model size—Whisper large-v3 requires ~3 GB VRAM—and real-time factor (RTF), where RTF < 1 means faster than real-time. Quantization (e.g., Q4_0) reduces memory but may degrade accuracy. Latency is critical for interactive use; batch processing (e.g., transcribing a recording) can tolerate higher latency. Speech pipelines often chain VAD + ASR + optional LLM for command understanding, each step adding latency and memory pressure.

Practical example

An operator runs Whisper large-v3 via whisper.cpp on an RTX 3060 (12 GB VRAM). The model takes ~3 GB, leaving room for a 30-second audio buffer. Transcribing a 1-hour lecture at Q4 quantization yields ~30 minutes wall-clock time (RTF ~0.5). If the operator switches to Whisper medium (1.5 GB), RTF drops to ~0.2 but word error rate increases from 5% to 10% on accented speech.

Workflow example

In a local voice assistant pipeline, the operator uses Silero VAD (via silero-vad Python library) to detect speech segments, then pipes audio chunks to Whisper.cpp for transcription. The transcribed text is sent to a local LLM (e.g., Llama 3.1 8B via Ollama) for intent parsing. The LLM response is fed to Coqui TTS for spoken output. Each step is monitored for latency: VAD < 50ms, ASR ~200ms per 5-second chunk, LLM ~1s, TTS ~300ms. If total latency exceeds 2s, the operator may switch to a smaller ASR model or quantize the LLM to Q4_K_M.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →