Model Summary

DAC auto-encoder models provide compact discrete tokenization of speech and audio signals that facilitate signal generation by cascaded generative AI models (e.g. multi-modal generative AI models) and high-quality reconstruction of the original signals. The current finetuned models improve upon the original DAC models by allowing a more compact representation for wide-band speech signals with high-quality signal reconstruction. The models achieve speech reconstruction, which is nearly indistinguishable from PCM with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings.

Model Speech Sample Rate codebooks Bit Rate Token Rate version
weights_24khz_3.0kbps_v1.0.pth 24kHz 4 3kHz 300Hz 1.0
weights_24khz_1.5kbps_v1.0.pth 24kHz 2 1.5kHz 150Hz 1.0

Usage

  • follow DAC installation instructions

  • clone the current repo

git clone https://huggingface.co/ibm/DAC.speech.v1.0
cd DAC.speech.v1.0

Compress audio

python3 -m dac encode /path/to/input --output /path/to/output/codes --weights_path weights_24khz_3.0kbps_v1.0.pth

This command will create .dac files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use python -m dac encode --help for more options.

Reconstruct audio from compressed codes

python3 -m dac decode /path/to/output/codes --output /path/to/reconstructed_input --weights_path weights_24khz_3.0kbps_v1.0.pth

This command will create .wav files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use python -m dac decode --help for more options.

Programmatic Usage

import dac
from audiotools import AudioSignal

# Download a model
model_path = 'weights_24khz_3.0kbps_v1.0.pth'
model = dac.DAC.load(model_path)

model.to('cuda')

# Load audio signal file
signal = AudioSignal('input.wav')

# Encode audio signal as one long file
# (may run out of GPU memory on long files)
signal.to(model.device)

x = model.preprocess(signal.audio_data, signal.sample_rate)
z, codes, latents, _, _ = model.encode(x)

# Decode audio signal
y = model.decode(z)

# Alternatively, use the `compress` and `decompress` functions
# to compress long files.

signal = signal.cpu()
x = model.compress(signal)

# Save and load to and from disk
x.save("compressed.dac")
x = dac.DACFile.load("compressed.dac")

# Decompress it back to an AudioSignal
y = model.decompress(x)

# Write to file
y.write('output.wav')

Citing & Authors

If you find this model helpful, feel free to cite our publication Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer:

@inproceedings{shechtman24_interspeech,
  title     = {Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer},
  author    = {Slava Shechtman and Avihu Dekel},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {4174--4178},
  doi       = {10.21437/Interspeech.2024-2366},
  issn      = {2958-1796},
}
Downloads last month
1
Inference API
Unable to determine this model's library. Check the docs .

Model tree for ibm/DAC.speech.v1.0

Base model

descript/dac_24khz
Finetuned
(1)
this model

Datasets used to train ibm/DAC.speech.v1.0