An update on Stable Audio 3. After my success with the Small versions, I tried the Medium in its bf16 version (small 4.6Gb size, slightly less precise output, can do both music and foley sound-effects). Medium can give up to six minutes of music or field-recording style sound-effects, and has better coherence than the two Small models.
Working well and fast, for me. So fast that it’s fun to use. The practical uses for the DAZ and Poser crowd are obvious — soundtracks and SFX for animations and games, soundscapes for still pictures or slide-shows. It was trained on the vast Freesound SFX library, so can do almost any sounds, and they’re commercial-use.
The official advice is that SA3 Medium ‘requires Flash Attention 2’. I have that installed, but Comfy can’t be coaxed to use it. However, despite not being able to load Flash Attention 2, the Medium model still runs fine for me in ComfyUI. Using LCM and Simple, whereas the Small models need PingPong as the sampler.
Restyling an existing track was tried in Medium. A folk-music .midi file was first converted to .MP3 (Windows Media Player -> Audacity -> .MP3) and loaded. SA3 veers off into ‘doing its own thing’ very easily, and having things in the prompt like “… and retain the exact sequencing and pace” seems to help a lot when restyling using different instruments (e.g. MIDI piano to choral).
To locally analyse audio for music description and thus create useful SA3 prompts, I was told you need Qwen2-Audio-7B-Instruct model. Note however that Jan.ai has only just introduced audio upload, in version 0.8.0 (22nd May 2026). So, users running their GGUF models in the excellent desktop freeware Jan will first need to upgrade to 0.8.0. After that, delete any already-imported Qwen2-Audio-7B-Instruct and reinstall in Jan along with its MMPROJ to enable the model’s audio comprehension.
I find the Qwen2-Audio-7B-Instruct model’s useless for my purposes though, as it turns out. Its very short generic responses are useless for converting into Stable Audio prompts (even when told to assume the personality of a professional musicologist), and when asked for anything longer the model’s context runs out. Increasing the context size and restarting the model makes no difference. It’s a fairly small older model, and I suspect it just can’t handle writing 150 words on something the length of three-minute song, even when compressed to a .MP3 smaller than 10Mb. So it’s useless, for both reasons. Deleted.
So Stable Audio3 is a keeper, partly because it produces excellent ambient and electronica. I’m keeping Stable Audio 1 around though, as it seems to handle sound-effect mixes better. e.g. a good field recording of a man walking through crispy autumn leaves, in a balanced mix with the sound of birds chirping.
Finally, note here’s now a local LoRA trainer for Stable Audio 3.
