The London-based startup Stability AI, known for its open-source image-generating AI model Stable Diffusion, has ventured deeper into the generative audio sector with the launch of Stable Audio. This innovative tool utilizes a technique called latent diffusion to create high-quality music suitable for commercial use.
A year ago, Stability AI introduced Dance Diffusion, a model capable of generating songs and sound effects based on text descriptions. However, the project lost momentum as the research organization Harmonai, which was responsible for creating the model, ceased updates and the tool lacked a user-friendly interface.
Rekindling its commitment to the audio sector, Stability AI has now unveiled Stable Audio, a groundbreaking tool that promises greater control over the content and duration of synthesized audio. The tool operates on a roughly 1.2 billion parameter model trained on audio metadata, including file durations and start times. Ed Newton-Rex, VP of audio for Stability AI, emphasized the company's mission to unlock human potential by developing foundational AI models across various content modalities, including language, code, and now music.
The development of Stable Audio was a collaborative effort between Stability's newly formed audio team and Harmonai. The tool allows users to guide the generation process using text prompts and setting desired durations, offering a significant upgrade from Dance Diffusion. It excels in generating beat-driven and ambient music, although it can also produce more experimental outputs in genres like classical and jazz.
Despite the promising features, Stability has not yet revealed plans to release the Stable Audio model as open-source. The tool is currently accessible through a web app, and samples provided by the company showcase its potential in creating coherent and melodic tracks across a range of genres.
Stable Audio stands out for its ability to maintain coherence for up to 90 seconds, a significant improvement over other AI models that tend to produce discordant noise beyond short durations. The secret behind this is the latent diffusion technique, which gradually reduces noise from a starting song, bringing it closer to the text description step by step.
The tool can also mimic sounds like a car passing by or a drum solo, showcasing its versatility. To train Stable Audio, Stability AI partnered with the commercial music library AudioSparx, utilizing around 800,000 songs from a catalog of largely independent artists.
Despite the promising start, the tool faces potential legal challenges, as it does not filter out prompts that could lead to copyright infringements. Newton-Rex acknowledged the limitations of the tool based on its training data and emphasized the company's efforts to implement content authenticity standards and watermarking to identify AI-assisted content.
Stable Audio offers a tiered subscription model, with the Pro tier allowing users to generate 500 commercializable tracks up to 90 seconds long monthly for $11.99. The free tier limits users to 20 non-commercializable tracks at 20 seconds long per month. The terms of service reserve Stability's right to use customer data, including prompts and songs, for various purposes, including the development of future models and services.
The collaboration with AudioSparx involves revenue sharing, with artists on the platform having the option to share in the profits generated by Stable Audio. However, the exact details of the revenue-sharing agreement remain undisclosed.
Stability AI, which recently raised $25 million through a convertible note, bringing its total funding to over $125 million, hopes that Stable Audio will help turn around its fortunes amidst challenges including low revenues and a high burn rate. The startup, last valued at $1 billion, aims to quadruple its valuation in the coming months, although this goal appears to be a long shot given the hurdles it needs to overcome.