Jukebox
An innovative AI music composer by OpenAI
By: Cindy Wen
TL;DR
Jukebox by OpenAI is a neural network that generates a variety of music and rudimentary singing using raw audio data instead of previously used MIDI (Musical Instrument Digital Interface) data. This project has progressively pushed the boundaries of automatic music generation, but there is still a pronounced gap compared to human-generated music.
The Breakdown
- JukeBox is an innovative AI music composer by OpenAI that utilizes a neural framework to generate various types of novel music as raw audio in a diverse set of genre and artist styles
- Previously, there were already different AI-driven models that had the ability to generate music, such as OpenAI’s own MuseNet, but until now, they were only capable of creating a few melodies in a MIDI quality
The Tech
- Jukebox was trained with 1.2 million songs, where 600 000 were in English, along with a variety of music pieces by composers, musicians, and bands. This was then paired with corresponding lyrics and metadata from LyricWiki (metadata included artist, genre, year of release, common moods, and associated playlist keywords)
- Jukebox takes the approach of sampling and upsampling raw audio data. This suggests that the model “understands” how different genres, styles, and voices sound, and records its interpretations of this data as audio files
Image from OpenAI Blog: Convolutional neural networks (CNN) were used to encode and compress raw audio data followed by a transformer to create novel compressed audio that was later upsampled to turn it back into raw audio
- The raw audio data was tackled by using VQ-VAE-2 — a simplified variant of VQ-VAE — to compress it into various discrete codes and then followed by modelling using autoregressive transformers (specifically, a variant of Sparse Transformers)
- VQ-VAE is a type of variational autoencoders that utilizes vector quantization and outputs discrete (instead of continuous) codes
- VQ-VAE-2 avoids the issue of hierarchy collapse due to the use of successive encoders paired with autoregressive decoders by using feedforward encoders and decoders instead
- A transformer model is a type of neural network used as a framework to solve machine translation issues, primarily applied in NLP (Natural Language Processing), and based on deep learning
- The model has the ability to learn unsupervised and works by clustering similar artists and genres close together as seen in the t-SNE below:
- Feel free to refer to OpenAI’s original blog post for a more detailed breakdown regarding the generation of code through transformers, artist and genre conditioning, lyric conditioning, and much more
The Significance
- Jukebox was shared with an initial set of 10 musicians to gauge an understanding of applications and general feedback for this project. Unfortunately, many of them did not find Jukebox to be effectively and immediately applicable to their current music practices
- As similar generative modelling continues to advance in various fields, it is also necessary to research issues such as bias and intellectual property rights by engaging in productive conversations with experts in these fields
- Though Jukebox is an innovative leap forward in the progression of musical quality, coherence, and capacity to condition artist and genre, there is still a pronounced gap in comparison to music created by humans