Generative AI

Music Generation

Music generation refers to the use of AI models to produce audio or symbolic representations of music (e.g., MIDI, sheet music) from prompts or conditioning inputs. Operators encounter this through models like MusicGen, AudioCraft, or Riffusion, which generate short clips (e.g., 10–30 seconds) of instrumental or vocal music. These models typically run on consumer GPUs with 8–16 GB VRAM for small variants, but longer or higher-quality generation may require more memory or slower CPU offload. The output is often a WAV or MP3 file, and generation speed is measured in seconds per clip rather than tokens per second.

Practical example

Using Facebook's MusicGen 'small' model (300M parameters) on an RTX 3060 12 GB, generating a 10-second clip from the prompt 'upbeat electronic dance music' takes about 15–20 seconds. The model loads into ~4 GB VRAM, leaving room for a batch of 2–3 generations. On an Apple M1 Max with 32 GB unified memory, the same generation runs in ~10 seconds via MLX. Attempting the 'large' model (3.3B parameters) on the same RTX 3060 would exceed VRAM, forcing CPU offload and increasing generation time to over a minute.

Workflow example

In LM Studio, an operator loads a MusicGen GGUF model (e.g., musicgen-small-Q4_K_M.gguf) and enters a text prompt like 'lo-fi hip hop beat with piano'. The UI shows a 'Generate' button; after clicking, the model processes the prompt and outputs a waveform preview. The operator can adjust parameters like 'duration' (5–30 seconds) and 'temperature' (0.7–1.2) to control creativity. The generated audio can be saved as a WAV file for further editing in a DAW.

Reviewed by Fredoline Eruo. See our editorial policy.