Some experiences of AI-based tools for audio and music

Some experiences of AI-based tools for audio and music

in

AI-based tools for audio and music

The world of music has certainly not remained unaffected by the surge of AI that has been the major buzz since a few years ago. Partly spurred by this recent development I have been searching the web for different AI-based services that are offered online, mainly during the fall and winter of 2023/2024. This has also been a part of my preparations for a lecture in an introductory course at the Faculty of Humanities. In this post I am presenting some of the lessons and findings during this search. One challenge in this work has been the constant development of new services and features, making it very difficult to have updated information. Another challenge has been that AI and machine learning algorithms have become so ubiquitous that I have had to select a few types of tools among many. Also, I have focused on the tools that, in my opinion, have been the most useful and interesting, avoiding those which have come across as either unfinished or less serious. In future posts I will hopefully be able to update and expand what is presented here.

In sum, this overview can therefore more be seen as a snapshot of the state-of-the-art of the following tool categories:


Tools using parameter/tag input

These tools generate music based on a selected set of categories or “tags” limiting the scope of the music generation. This are usually tags like genre, subgenre and/or style and other categories related to feeling and/or mood, and sometimes even types of usage/setting. Often, such tags are combined with certain musical parameters that the user can set. Tempo and key are among the most common of these parameters, but other parameters related to instrumentation, intensity or density are also sometimes available for the user. The following tools have been tested and reviewed:


AIVA

Aiva, welcome page

  • Link to Service: Website
  • Academic Paper(s): None found (some of the following information about AIVA comes from a TED-talk by its founder, Pierre Barreau, here)
  • Model Architecture: Deep neural network, symbolic
  • Training Data Set: Music scores
  • Type of Output: Tracks up to 3 min (free), 5.5 min (Pro)
  • Type(s) of User Input: Audio/MIDI file (influence), instrumentation, key, emotion, duration, generation profiles, MIDI editor
  • API:No
  • Price(s)/plans: Free, Standard ($15/m), Pro ($49/m)
  • Platform(s): Web-based + App (has more advanced features)
  • Editing possibilities: Piano roll, simple MIDI-based DAW
  • Link to music/audio examples: Link to playlist (Youtube)
  • **Lisence:
  • Download File Formats: MIDI, MP3 (free), WAV (Pro)
  • Free version limitations: 3 downloads/ month
  • Data Set Size: ca. 30k tracks

My review:

This tool, which is available both online and a more sophisticated version for download, offers a simple DAW-like interface that allows for editing on the note level as well as virtual instruments, volume and effects for single voices. It has been around since 2017, and it also has some very convincing sound examples available on Youtube, recorded with professional musicians and orchestra. In informal polls I have done at a couple of talks about music in AI, people have often assumed that AIVA’s Among the stars is composed by a human composer.

On the whole, the musical results I have been able to achieve with AIVA have varied a great deal in musicality and quality, from songs which have appeared pretty convincing to (the minority), to those who have interesting parts, but doesn’t work in its entirety, to results that are completely useless. My best results have still quite far from those in AIVA’s own sound examples on Youtube. Thus, I have speculated whether those examples are the results of a combination of heavily curated “best examples” with some human “stiching together” of those to make the whole thing come together. Or, it might be, of course, that the creators of AIVA know the secrets of their tools so well that they can tweak the results to perfection. But if such tweaking is needed, one might argue that this goes against the general intention of AI-based music generation.

One big issue I have experienced in the musical results AIVA has generated for me, is the lack of structure and development in the composition. Often, the music will repeat a certain structure for way too long without introducing new elements that drive the music forward, especially if the compositions are of some length. Nevertheless, AIVA seems to be the tool with most control over the generation of the musical structure, often with intro, outro, bridges and multiple sections that can be repeated or varied.

An issue that AIVA shares with many of its competitors is the problem with generating convincing melodies. The melodies often come across as extensions of chord elements or just lacking in salience, contour or dramaturgy. They often have an element of repetition and variation, but in ways that seem either contrived or plainly random.

What perhaps sets AIVA most apart from its competitors is its inclusion of quite extensive editing possibilities. These are, however, only available in the downloadable app. Here, you will get access to the single notes of composition in a piano roll format, where you can move, cut, copy and paste existing notes along with creating new ones. Aditionally, you have access to a relatively wide selection of (often sample-based) instruments, which can be combined and replaced.

What I have often found limiting is that some of the imitative synths used by the model sound produce an unmistakably MIDI’fied quality, often producing quite un-ideomatic phrases. For some instruments, however, e.g. piano, the results are pretty good.

There are also standard mixing elements (volume, pan), a small effects section, including simple automation, octave and simple dynamics settings. Compared to a standard DAW, this is obviously not impressive, but it still gives you the opportunity to tweak results that are “almost there”.

Another feature I like with AIVA is that it allows for uploading of influence tracks in either audio or MIDI format, which few other tools do. However, it seems that the parameters analysed from the influence songs mostly relate to higher level features, resulting in compositions that bare little resemblance with its influence tracks.

AIVA has quite sophisticated ways of tayloring the output through the so called “style designer”. Here, you can set a wide range of musical parameters including dynamic range, melodic phrase length, tempo range and even which dataset to use for the harmony. I have to admit that I didn’t spend a lot of time investigating how this could actually affect the generated music, and perhaps the secret to making good tracks in AIVA lies in a careful design of such styles. However, this would probably take both quite a bit of time but also musical know-how to be able to exploit to the fullest.

One last thing I have to mention is that AIVA allows for generating tracks based on melodies as well as harmonies uploaded as MIDI files by the user.

Conclusion: Of the tools using parameter and audio/MIDI input this is probably most sophisticated tool of the ones I have reviewed. Although the quality of the generated music is quite variable, one can tweak quite a lot to improve the results. Thus, this tool would probably work best for users who know a little bit about music production from before.


Boomy

Boomy, welcome page

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:No info
  • Training Data Set:No info
  • Type of Output:Audio streaming and download
  • Type(s) of User Input:Genre, style, drums, mixing, ambient, tempo, density of parts. Add voice (Upload/ record)
  • API:No
  • Price(s)/plans:Free, Creator ($9.99/month), Pro ($29.99)
  • Platform(s):Web-based
  • Editing possibilities:Edit structure (delete, copy, reorder measures)
  • Link to music/audio examples:Link to Spotify playlist
  • Lisence:Personal use (free), non-commercial (Creator/Pro), commercial on podcasts and social media (Pro)
  • Download File Formats:10 MP3 downloads/ month (Creator), 25 WAV downloads/ month (Pro)
  • Free version limitations:No download
  • Data Set Size:No info

My review:

  • Musical results are a lot more unpredictable than comparable tools. This might be interesting for some, but also sometimes leads to musical elements that don’t go well together or sound “random”. Also, results often sound more like sketches than finished songs.
  • Relatively many editing possibilities compared to the other tools, e.g. you can adjust the density of events for each of the stems.
  • Stands out from other tools in that you can provide vocals to the tracks, either by uploading a sound file or recording in the browser. However, this feature seems quite buggy.
  • Even allows for choosing different mix parameters, including sound quality (e.g. super clean/lo-fi), and type and amount of reverb
  • The five genres/styles you can choose from to begin making a song seem a bit arbitrary (Electronic Dance, Lo-fi, Relaxing Meditation, Rap Beats, Global Groove)
  • Only makes one song at a time
  • Also allows for adding non-musical elements (“sound effects” like waves, birds or traffic sounds).
  • Interesting that some genres/styles allows for variable tempo
  • Editing of the structure of the song only allows for cutting or copying shorter sections
  • Dolby remastering is available for purchase
  • Conclusion: Perhaps a better tools to get ideas than something you would use as a finished product. A lot of what you make may be surprising, or experienced as a bit “quirky” or “off”.

Loudly

Loudly, welcome page

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:No info
  • Training Data Set:No info
  • Type of Output:Audio streaming and download
  • Type(s) of User Input:Genre, energy (lo/mid/hi), tempo, duration (max 7 min.), key
  • API:No
  • Price(s)/plans:Free, Personal ($5.99), Pro ($14.99)
  • Platform(s):Web-based
  • Editing possibilities:No
  • Link to music/audio examples:Link to examples
  • Lisence:License covers all social media (Free), livestreams, podcasts (Personal)
  • Download File Formats:WAV, MP3, stems (paid plans)
  • Free version limitations:No download
  • Data Set Size:Unknown

My review

  • Sounds like it’s made in the same style as Soundraw:based on samples, with repeating blocks of music and some varying elements.
  • Has four structural templates, however it’s not possible to edit the structure itself.
  • Doesn’t always manage to combine elements that actually fit together – some sounds are quite odd.
  • Has more structural variation than Soundraw – sometimes new and quite surprising parts appear. But there’s a very low degree of recognition of “common” structures (e.g., intro - verse - bridge - chorus - verse…).
  • Some sections of the songs sometimes have almost no variation. When these include vocals, this can be quite irritating.
  • Genre labels don’t always seem entirely consistent.
  • Includes vocals, but the quality and meaningfulness are very variable.
  • The tracks often seem quite aimless and have little dynamics that feel meaningful.
  • Soundraw seems a notch sharper in sound, but this one sounds a lot better after final mixing.

Soundful

Soundful, welcome page

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:Not specified
  • Training Data Set:Not specified
  • Type of Output:Tracks, loops, stems
  • Type(s) of User Input:Genre/style, key, tempo
  • API:Yes
  • Price(s)/plans:Free, Content Creator ($29.99), Music Creator Plus ($59.99)
  • Platform(s):Web-based
  • Editing possibilities:Tempo, key
  • Link to music/audio examples:Link to examples, scroll down
  • Lisence:Can be purchased as an extra service
  • Download File Formats:WAV, MP3
  • Free version limitations:3 downloads /month
  • Data Set Size:No info

My review

  • The web interface first and foremost appears like a site for stock music. You have to click an icon on one of the genres to be taken to the otherwise hidden “create a track” functionality.
  • Primarily takes genre and style as user input. Has a few main genres, then you choose a style/sub-genre.
  • Seems to follow the same pattern as Loudly and Soundraw: 8 or 16-bar sections that repeat.
  • Takes quite a bit of time to generate, and only creates one track at a time. Only produces a “preview”. Proper mixing needs to be rendered and takes several minutes.
  • Like Soundraw, often very repetitive. And the chord progressions are boring.
  • Relatively few editing options.
  • Quite irritating that it keeps urging you to purchase styles. As of April 2024, only very few few styles are available, e.g. for “Lo-fi” only 2 out of 10 styles were free.

Soundraw

Soundraw, edit sounds page

  • Link to Service:Website
  • Academic Paper(s):No paper
  • Model Architecture:No info
  • Training Data Set:No info
  • Type of Output:Audio streaming; download (w/plan)
  • Type(s) of User Input:Duration (-> 5 min), tempo, key, instrumentation, genre, mood, theme
  • API:API for businesses
  • Price(s)/plans:Free, Creator ($16.99/m), Artist ($29.99/m), API ($500/m)
  • Platform(s):Web based
  • Editing possibilities:Simple editing in “pro mode”:add/delete bars, three levels of intensity per bar per instrument + fills
  • Link to music/audio examples:Link to examples (scroll down)
  • Lisence:Royalty-free Creator and Artist plans, royalty ownership with API plan
  • Download File Formats:Unlimited for Creator plan, 30/m for Artist plan
  • Free version limitations:No download
  • Data Set Size:No info

My review:

  • Offers a rich array of electronic genres. There’s only one “Rock” option, though.
  • Appears to be based on sampler synths.
  • Sounds pretty decent, especially in terms of arrangement and balance between elements.
  • Simple editing of structure, making it easy to extend songs and vary overall dynamics.
  • Allows for adjustment of individual instrument levels.
  • Songs adhere to a fixed structure:8- or 16-bar sections that repeat, resulting in variations of the same theme (A – A’ – A’’). Each section has three intensity levels (absent – soft – strong).
  • Due to the structural limitations the music tends to be predictable and might not offer many surprises.
  • Quickly generates 15 tracks at a time that are fast to load and play.
  • Not particularly strong in melody – the melody often feels more like an added flavor than a central element.
  • No vocal tracks included.
  • Conclusion: I imagine this tools can be well-suited for creating background music in audiovisual productions if one aims just to “fill out” a sonic background. As for the musical results, they are quite boring and predictable.

Tools primarily using text prompt input

Since the boom of ChatGPT in 2022, making interfaces that have text prompt has seemed like the next big thing. The promise of that kind of interface has been that one can enter one’s specifications for what is to be generated in almost infinite detail. The growing number of tools that allow for text prompts, however, has so far proved that there are major challenges and issues in using verbal user input. The tools presented here all allow for some kind of text input, but most of them combine this with either tags, parametric settins and/or audio input. I haven’t included any review for these tools, but I hope to update this very soon. These are the tools I have tested (note that Google’s tool is omitted since it is not currently available in Norway):


Udio (beta)

Udio, welcome page

  • Link to Service: Website
  • Academic Paper(s): None found
  • Model Architecture: No info
  • Training Data Set: No info
  • Type of Output: Audio streaming and download
  • Type(s) of User Input: Text prompt, custom lyrics
  • API: No
  • Price(s)/plans: As of April 2024 only exists in a free beta version
  • Platform(s): Web based
  • Editing possibilities: Extend track, remix track (including specifying the degree of similarity to the original track)
  • Link to music/audio examples: Several curated examples are presented on the main website
  • Lisence: In the current version (April 2024) the user has the non-exclusive right and license to access and display the content and materials of the service
  • Download File Formats: .mp3
  • Free version limitations: 33 seconds per generation, 600 generations per month
  • Data Set Size: No info

Suno

Suno, welcome page

  • Academic Paper(s): None found
  • Model Architecture: According to an interview in the Rolling Stone magazine, Suno is transformer based (see the interview here)
  • Training Data Set: No info
  • Type of Output: Audio streaming and download
  • Type(s) of User Input: Text prompt input. Custom lyrics, style and title can be set in Custom mode.
  • API: No
  • Price(s)/plans: Free (10 songs daily), Pro ($8/m, 500 songs/m) and Premiere ($24/m, 2000 songs/m)
  • Platform(s): Web based
  • Editing possibilities: Extend track, remix track (including specifying the degree of similarity to the original track)
  • Link to music/audio examples: Several curated examples are presented on the main website
  • Lisence: Free version has non-commercial licensing, paid plans has “general commercial terms”.
  • Download File Formats: .mp3
  • Free version limitations: Up to 2 minutes per generation
  • Data Set Size: No info
  • Watermarking: The Suno blog informs that inaudible watermarking is implemented in the generated songs, so that it is in principle possible to detect if a song has been made with Suno (see the blog post here)

Stable Audio

Suno, generate page


Beatoven

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:No info
  • Training Data Set:No info
  • Type of Output:Audio streaming and download
  • Type(s) of User Input:Text prompt (beta), duration, tempo, genre, emotion
  • API:Yes
  • Price(s)/plans:Free, Pro ($6-20/mo), Buy minutes ($3/ min)
  • Platform(s):Web based
  • Editing possibilities:Volume automation. Recompose/ make new emotion sections, edit instruments.
  • Link to music/audio examples:Link to Spotify artist with examples
  • Lisence:Beatoven owns copyright, but free to use
  • Download File Formats:Track or stems with paid plan
  • Free version limitations:No download
  • Data Set Size:No info

Riffusion

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:Latent diffusion-based text-to-image generation model
  • Training Data Set:No info
  • Type of Output:Audio streaming
  • Type(s) of User Input:Text prompts (lyrics and sound)
  • API:No
  • Price(s)/plans:Free
  • Platform(s):Web-based
  • Editing possibilities:No
  • Link to music/audio examples:Link to examples
  • Lisence:No info
  • Download File Formats:Riffs (=music/ lyrics snippets of 12 s.)
  • Free version limitations:No
  • Data Set Size:No info

Mubert

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:No info
  • Training Data Set:No info
  • Type of Output:Ouput format**:Track, loop, mix, jingle. Output is watermarked when streaming
  • Type(s) of User Input:Text prompts; image files; or genre, moods or activities. Duration (->45s.)
  • API:Yes
  • Price(s)/plans:Free, Creator ($14/m), Pro ($39/m), Business ($199/m)
  • Platform(s):Web based
  • Editing possibilities:No
  • Link to music/audio examples:Link to examples (scroll down to Staff picks)
  • Lisence:Free**:Copyright credited to Mubert, otherwise depends on plan
  • Download File Formats:MP3 download, lossless quality with paid plans
  • Free version limitations:Generate up to 25 tracks/ mo, MP3 download
  • Data Set Size:No info

Splash Pro

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:No info
  • **Training Data Set:According to their website, the Splash AI is trained on a library containing more than 100,000 loops (see this link).
  • Type of Output:Tracks with or without vocals. Free version only has 1 rapper.
  • Type(s) of User Input:Text prompts, BPM, key, mode, duration
  • **API:Yes, with Enterprise plan
  • Price(s)/plans:Free, Starter ($6/m), Max ($29/m), Enterprise (no price)
  • Platform(s):Web based
  • Editing possibilities:No
  • Link to music/audio examples:Link to examples
  • Lisence:Unlimited commercial license with Starter and Max plans.
  • Download File Formats:MP3 download (Free/Starter), lossless .wav with Max plan.
  • Free version limitations:Only 1 rapper vocalist, MP3 download
  • Data Set Size:100,000 loops (see Training data set)

TextToSample

  • Link to Service:Website
  • Academic Paper(s):None found
  • Model Architecture:MusicGen (Meta)
  • Training Data Set:Not specified
  • Type of Output:MIDI
  • Type(s) of User Input:Text prompts, melody (file/mic), audio file
  • API:No
  • Price(s)/plans:Free
  • Platform(s):VST plugin or standalone app (win/mac)
  • Editing possibilities:No examples provided
  • Link to music/audio examples:No
  • Lisence:No
  • Download File Formats:No
  • Free version limitations:12s. of audio
  • Data Set Size:See MusicGen

MusicGen (demo)

  • Link to Service:Website
  • Academic Paper(s):Simple and Controllable Music Generation
  • Model Architecture:Single-stage transformer LM together with efficient token interleaving patterns
  • Training Data Set:Internal dataset of high-quality music tracks + ShutterStock and Pond5 music data
  • Type of Output:15 sec. audio file (WAV)
  • Type(s) of User Input:Text prompts, melody (audio file/mic)
  • API:Yes
  • Price(s)/plans:Free demo + open source code
  • Platform(s):Web-site (demo)
  • Editing possibilities:No
  • Link to music/audio examples:Link to examples
  • Lisence:No
  • Download File Formats:No
  • Free version limitations:15s. of audio
  • Data Set Size:Dataset of 10k high-quality music tracks + ShutterStock and Pond5 music data

Tools for music analysis and classification

Cyanite

  • Link to Service:Website
  • Type(s) of User Input: Youtube link or file upload
  • Music analysis parameters/classes:BPM; key; predominant voice gender; voice presence profile; category tags and means**:genre, subgenre, moods, mood advanced, character, movement, classical epoch; valence/arousal (means); energy level; emotional profile; energy dynamics; emotional dynamics; instruments (tags/presence); meter; musical era
  • Time-variant analysis:Genre/subgenre, mood, emotional profile/energy level, instrumentation, voice gender/voice presence profile
  • Other features:Keyword and free text search. Playlist matching. Results from either personal library or spotify.
  • Download option:No
  • Platform(s):Web-site
  • API:Yes
  • Similarity search:Yes
  • Price(s)/plans:Free version, subscription (€19.90/month**:20 analysis/month), single analysis (€1.99/song)
  • Free version limitations:Only 5 analyses per month.

Sonoteller

  • Link to Service:Website
  • Type(s) of User Input:Youtube link
  • Music analysis parameters/classes:Genres, subgenres, moods, instruments, BPM, key, vocal rigister
  • Lyrics analysis:Summary, moods, themes, language and explicit.
  • Time-variant analysis:Genre/subgenre, mood, emotional profile/energy level, instrumentation, voice gender/voice presence profile
  • Other features:API version has sections analysis
  • Download option:No
  • Platform(s):Web-site
  • Price(s)/plans:Web version only has free option.
  • API:Yes. Also has test version available in browser for free
  • Free version limitations:No.

Chordify

  • Link to Service:Website
  • Type(s) of User Input:Youtube search, file upload
  • Analysis output:Chord symbols (per beat), chord visualization for guitar, ukulele, piano and mandolin.
  • Editing possibilities:Yes
  • Other features:Simplify chords. Pro only**:Countdown, volume, loop, tempo, capo, transposition
  • Time-variant analysis:Chords (1/beat)
  • Download option:Only in Pro version (MIDI)
  • Price(s)/plans:Free (with ads). Premium (16 NOK/month or 192 NOK/year
  • API:No
  • Free version limitations:3 songs per day

Tools for stem/source separation

Musai

Fadr