File size: 6,208 Bytes
841e770 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: cc-by-4.0
tags:
- encodec
- audio
- music
- audiocraft
---
<a target="_blank" href="https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<br>
# MultiBand Diffusion
<!-- Provide a quick summary of what the model is/does. -->
This repository contains the weights for Meta's MultiBand Diffusion models, described in this research paper: [From Discrete Tokens to High Fidelity Audio using MultiBand Diffusion][arxiv].
MultiBand diffusion is a collection of 4 models that can decode tokens from <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> into waveform audio.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Meta
- **Model type:** Diffusion Models
- **License:** The models weights in this repository are released under the CC-BY-NC 4.0 license.
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [AudioCraft repo](https://github.com/facebookresearch/audiocraft/tree/main)
- **Paper:** [From Discrete Tokens to High Fidelity Audio using MultiBand Diffusion](https://dl.fbaipublicfiles.com/encodec/Diffusion/paper.pdf)
## Installation
Please follow the AudioCraft installation instructions from the [README](../README.md).
## Usage
[AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main) offers a number of way to use MultiBand Diffusion:
1. A MusicGen demo includes a toggle to try diffusion decoder. You can use the demo locally by running [`python -m demos.musicgen_app --share`](https://github.com/facebookresearch/audiocraft/tree/main/demos/musicgen_app.py), or through a [MusicGen Colab](https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing).
2. You can play with MusicGen by running the jupyter notebook at [`demos/musicgen_demo.ipynb`](https://github.com/facebookresearch/audiocraft/tree/main/demos/musicgen_demo.ipynb) locally (if you have a GPU).
## API
[AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main) provides a simple API and pre-trained models for MusicGen and for EnCodec at 24 khz for 3 bitrates (1.5 kbps, 3 kbps and 6 kbps).
See after a quick example for using MultiBandDiffusion with the MusicGen API:
```python
import torchaudio
from audiocraft.models import MusicGen, MultiBandDiffusion
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('facebook/musicgen-melody')
mbd = MultiBandDiffusion.get_mbd_musicgen()
model.set_generation_params(duration=8) # generate 8 seconds.
wav, tokens = model.generate_unconditional(4, return_tokens=True) # generates 4 unconditional audio samples and keep the tokens for MBD generation
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
wav_diffusion = mbd.tokens_to_wav(tokens)
wav, tokens = model.generate(descriptions, return_tokens=True) # generates 3 samples and keep the tokens.
wav_diffusion = mbd.tokens_to_wav(tokens)
melody, sr = torchaudio.load('./assets/bach.mp3')
# Generates using the melody from the given audio and the provided descriptions, returns audio and audio tokens.
wav, tokens = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr, return_tokens=True)
wav_diffusion = mbd.tokens_to_wav(tokens)
for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav and {idx}_diffusion.wav, with loudness normalization at -14 db LUFS for comparing the methods.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
audio_write(f'{idx}_diffusion', wav_diffusion[idx].cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
```
For the compression task (and to compare with [EnCodec](https://github.com/facebookresearch/encodec)):
```python
import torch
from audiocraft.models import MultiBandDiffusion
from encodec import EncodecModel
from audiocraft.data.audio import audio_read, audio_write
bandwidth = 3.0 # 1.5, 3.0, 6.0
mbd = MultiBandDiffusion.get_mbd_24khz(bw=bandwidth)
encodec = EncodecModel.get_encodec_24khz()
somepath = ''
wav, sr = audio_read(somepath)
with torch.no_grad():
compressed_encodec = encodec(wav)
compressed_diffusion = mbd.regenerate(wav, sample_rate=sr)
audio_write('sample_encodec', compressed_encodec.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
audio_write('sample_diffusion', compressed_diffusion.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
```
## Training
A [DiffusionSolver](https://github.com/facebookresearch/audiocraft/tree/main/audiocraft/solvers/diffusion.py) implements Meta diffusion training pipeline.
It generates waveform audio conditioned on the embeddings extracted from a pre-trained EnCodec model
(see [EnCodec documentation from the AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main/ENCODEC.md) for more details on how to train such model).
Note that **the library do NOT provide any of the datasets** used for training our diffusion models.
We provide a dummy dataset containing just a few examples for illustrative purposes.
### Example configurations and grids
One can train diffusion models as described in the paper by using this [dora grid](https://github.com/facebookresearch/audiocraft/tree/main/audiocraft/grids/diffusion/4_bands_base_32khz.py).
```shell
# 4 bands MBD trainning
dora grid diffusion.4_bands_base_32khz
```
### Learn more
Learn more about AudioCraft training pipelines in the [dedicated section](https://github.com/facebookresearch/audiocraft/tree/main/TRAINING.md).
## Citation
```
@article{sanroman2023fromdi,
title={From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion},
author={San Roman, Robin and Adi, Yossi and Deleforge, Antoine and Serizel, Romain and Synnaeve, Gabriel and Défossez, Alexandre},
journal={arXiv preprint arXiv:},
year={2023}
}
```
[arxiv]: https://dl.fbaipublicfiles.com/encodec/Diffusion/paper.pdf
[mbd_samples]: https://ai.honu.io/papers/mbd/ |