RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Generative AI / Audio Generation
Generative AI

Audio Generation

Audio generation refers to the process of creating audio content—such as speech, music, or sound effects—using machine learning models. In local AI, operators run models like Bark, MusicGen, or Stable Audio on their own hardware. These models generate audio from text prompts or other conditioning inputs. The key operator concern is VRAM usage: generating a few seconds of audio can require 4-8 GB of VRAM for smaller models, while larger models may need 12+ GB. Latency is also a factor, as audio generation is typically slower than text generation, often taking tens of seconds to produce a short clip.

Deeper dive

Audio generation models typically use a two-stage pipeline: first, a language model or diffusion model generates a compressed audio representation (e.g., tokens from an audio codec like EnCodec or SoundStream), then a decoder reconstructs the waveform. Popular local models include Meta's MusicGen (for music), Suno's Bark (for speech and sound effects), and Stability AI's Stable Audio (for music and sound). Operators running these models on consumer GPUs must consider quantization (e.g., using 4-bit or 8-bit to fit in VRAM) and prompt engineering to control output quality. Generation speed varies: MusicGen can produce ~10 seconds of audio per minute on an RTX 3090, while Bark is slower due to its autoregressive nature. For real-time applications, smaller models like Coqui TTS are preferred.

Practical example

On an RTX 3090 (24 GB VRAM), running MusicGen 'melody' model at FP16 uses ~8 GB VRAM and generates 10 seconds of music in about 30 seconds. Using 4-bit quantization reduces VRAM to ~3 GB but may slightly degrade quality. For speech, Bark at FP16 uses ~6 GB VRAM and generates 5 seconds of speech in ~20 seconds.

Workflow example

In LM Studio, an operator can load a MusicGen model (e.g., 'facebook/musicgen-medium') and enter a prompt like 'upbeat electronic dance music with bass'. The UI shows VRAM usage and estimated generation time. After generation, the audio file can be saved or played. In Ollama, audio generation is not yet natively supported, but custom scripts using the Transformers library can load models and generate audio via Python.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →