Environmental sound generation

AudioGen: Textually-Guided Audio Generation

With AudioGen, we demonstrated that we can train AI models to perform the task of text-to-audio generation. AudioGen is an auto-regressive transformer language model model conditioned on text: Given a textual description of an acoustic scene, the model can generate the environmental sound corresponding to the description with realistic recording conditions and complex scene context.

Text-to-sound generation

AudioGen is trained for the task of text-to-sound generation. Given a text prompt, it generates 5 seconds of audio adhering to the provided text description.

Whistling with wind blowing

Sirens and a humming engine approach and pass

A duck quacking as birds chirp and a pigeon cooing

Railroad crossing signal followed by a train passing and blowing horn

Typing on a typewriter