MusicGen / docs /ENCODEC.md
reach-vb's picture
reach-vb HF staff
Stereo demo update (#60)
5325fcc

A newer version of the Gradio SDK is available: 5.12.0

Upgrade

EnCodec: High Fidelity Neural Audio Compression

AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning based audio codec supporting both mono stereo audio, presented in the High Fidelity Neural Audio Compression paper. Check out our sample page.

Original EnCodec models

The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed and used with the EnCodec repository.

Note: We do not guarantee compatibility between the AudioCraft and EnCodec codebases and released checkpoints at this stage.

Installation

Please follow the AudioCraft installation instructions from the README.

Training

The CompressionSolver implements the audio reconstruction task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec - using a combination of objective and perceptual losses in the forms of discriminators.

The default configuration matches a causal EnCodec training with at a single bandwidth.

Example configuration and grids

We provide sample configuration and grids for training EnCodec models.

The compression configuration are defined in config/solver/compression.

The example grids are available at audiocraft/grids/compression.

# base causal encodec on monophonic audio sampled at 24 khz
dora grid compression.encodec_base_24khz
# encodec model used for MusicGen on monophonic audio sampled at 32 khz
dora grid compression.encodec_musicgen_32khz

Training and valid stages

The model is trained using a combination of objective and perceptual losses. More specifically, EnCodec is trained with the MS-STFT discriminator along with objective losses through the use of a loss balancer to effectively weight the different losses, in an intuitive manner.

Evaluation stage

Evaluations metrics for audio generation:

  • SI-SNR: Scale-Invariant Signal-to-Noise Ratio.
  • ViSQOL: Virtual Speech Quality Objective Listener.

Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in order to run the ViSQOL metric on the reference and degraded signals. The metric is disabled by default. Please refer to the metrics documentation to learn more.

Generation stage

The generation stage consists in generating the reconstructed audio from samples with the current model. The number of samples generated and the batch size used are controlled by the dataset.generate configuration. The output path and audio formats are defined in the generate stage configuration.

# generate samples every 5 epoch
dora run solver=compression/encodec_base_24khz generate.every=5
# run with a different dset
dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER>
# limit the number of samples or use a different batch size
dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4

Playing with the model

Once you have a model trained, it is possible to get the entire solver, or just the trained model with the following functions:

from audiocraft.solvers import CompressionSolver

# If you trained a custom model with signature SIG.
model = CompressionSolver.model_from_checkpoint('//sig/SIG')
# If you want to get one of the pretrained models with the `//pretrained/` prefix.
model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz')
# Or load from a custom checkpoint path
model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th')


# If you only want to use a pretrained model, you can also directly get it
# from the CompressionModel base model class.
from audiocraft.models import CompressionModel

# Here do not put the `//pretrained/` prefix!
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
model = CompressionModel.get_pretrained('dac_44khz')

# Finally, you can also retrieve the full Solver object, with its dataloader etc.
from audiocraft import train
from pathlib import Path
import logging
import os
import sys

# uncomment the following line if you want some detailed logs when loading a Solver.
logging.basicConfig(stream=sys.stderr, level=logging.INFO)
# You must always run the following function from the root directory.
os.chdir(Path(train.__file__).parent.parent)


# You can also get the full solver (only for your own experiments).
# You can provide some overrides to the parameters to make things more convenient.
solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}})
solver.model
solver.dataloaders

Importing / Exporting models

At the moment we do not have a definitive workflow for exporting EnCodec models, for instance to Hugging Face (HF). We are working on supporting automatic convertion between AudioCraft and Hugging Face implementations.

We still have some support for fine tuning an EnCodec model coming from HF in AudioCraft, using for instance continue_from=//pretrained/facebook/encodec_32k.

An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.) using audiocraft.utils.export.export_encodec. For instance, you could run

from audiocraft.utils import export
from audiocraft import train
xp = train.main.get_xp_from_sig('SIG')
export.export_encodec(
    xp.folder / 'checkpoint.th',
    '/checkpoints/my_audio_lm/compression_state_dict.bin')


from audiocraft.models import CompressionModel
model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin')

from audiocraft.solvers import CompressionSolver
# The two are strictly equivalent, but this function supports also loading from non already exported models.
model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin')

We will see then how to use this model as a tokenizer for MusicGen/Audio gen in the MusicGen documentation.

Learn more

Learn more about AudioCraft training pipelines in the dedicated section.

Citation

@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

License

See license information in the README.