With AudioCraft, we simplify the overall design of generative models for audio compared to prior work.
Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) that operates over streams of compressed discrete music representation, i.e., tokens. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that, with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality audio.Learn more about MusicGenLearn more about AudioGen
Our models leverage the EnCodec neural audio codec to learn the discrete audio tokens from the raw waveform. EnCodec maps the audio signal to one or several parallel streams of discrete tokens. We then use a single autoregressive language model to recursively model the audio tokens from EnCodec. The generated tokens are then fed to EnCodec decoder to map them back to the audio space and obtain the output waveform. Finally, different types of conditioning models can be used to control the generation such as using a pretrained text encoder for text-to-audio applications.Learn more about EnCodec