Audio Generation
Audio generation refers to the process of creating audio content—such as speech, music, or sound effects—using machine learning models. In local AI, operators run models like Bark, MusicGen, or Stable Audio on their own hardware. These models generate audio from text prompts or other conditioning inputs. The key operator concern is VRAM usage: generating a few seconds of audio can require 4-8 GB of VRAM for smaller models, while larger models may need 12+ GB. Latency is also a factor, as audio generation is typically slower than text generation, often taking tens of seconds to produce a short clip.
Deeper dive
Audio generation models typically use a two-stage pipeline: first, a language model or diffusion model generates a compressed audio representation (e.g., tokens from an audio codec like EnCodec or SoundStream), then a decoder reconstructs the waveform. Popular local models include Meta's MusicGen (for music), Suno's Bark (for speech and sound effects), and Stability AI's Stable Audio (for music and sound). Operators running these models on consumer GPUs must consider quantization (e.g., using 4-bit or 8-bit to fit in VRAM) and prompt engineering to control output quality. Generation speed varies: MusicGen can produce ~10 seconds of audio per minute on an RTX 3090, while Bark is slower due to its autoregressive nature. For real-time applications, smaller models like Coqui TTS are preferred.
Practical example
On an RTX 3090 (24 GB VRAM), running MusicGen 'melody' model at FP16 uses ~8 GB VRAM and generates 10 seconds of music in about 30 seconds. Using 4-bit quantization reduces VRAM to ~3 GB but may slightly degrade quality. For speech, Bark at FP16 uses ~6 GB VRAM and generates 5 seconds of speech in ~20 seconds.
Workflow example
In LM Studio, an operator can load a MusicGen model (e.g., 'facebook/musicgen-medium') and enter a prompt like 'upbeat electronic dance music with bass'. The UI shows VRAM usage and estimated generation time. After generation, the audio file can be saved or played. In Ollama, audio generation is not yet natively supported, but custom scripts using the Transformers library can load models and generate audio via Python.
Reviewed by Fredoline Eruo. See our editorial policy.