File size: 2,511 Bytes
4ad0c19 9642163 4ad0c19 9642163 aa88eb2 9642163 aa88eb2 9642163 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: mit
tags:
- DAC
- Descript Audio Codec
- PyTorch
---
# Descript Audio Codec (DAC)
DAC is the state-of-the-art audio tokenizer with improvement upon the previous tokenizers like SoundStream and EnCodec.
This model card provides an easy-to-use API for a *pretrained DAC* [1] for 16khz audio whose backbone and pretrained weights are from [its original reposotiry](https://github.com/descriptinc/descript-audio-codec). With this API, you can encode and decode by a single line of code either using CPU or GPU. Furhtermore, it supports chunk-based processing for memory-efficient processing, especially important for GPU processing.
### Model variations
There are three types of model depending on an input audio sampling rate.
| Model | Input audio sampling rate [khz] |
| ------------------ | ----------------- |
| [`hance-ai/descript-audio-codec-44khz`](https://huggingface.co/hance-ai/descript-audio-codec-44khz) | 44.1khz |
| [`hance-ai/descript-audio-codec-24khz`](https://huggingface.co/hance-ai/descript-audio-codec-24khz) | 24khz |
| [`hance-ai/descript-audio-codec-16khz`](https://huggingface.co/hance-ai/descript-audio-codec-16khz) | 16khz |
# Usage
### Load
```python
from transformers import AutoModel
# device setting
device = 'cpu' # or 'cuda:0'
# load
model = AutoModel.from_pretrained('hance-ai/descript-audio-codec-16khz', trust_remote_code=True)
model.to(device)
```
### Encode
```python
audio_filename = 'path/example_audio.wav'
zq, s = model.encode(audio_filename)
```
`zq` is discrete embeddings with dimension of (1, num_RVQ_codebooks, token_length) and `s` is a token sequence with dimension of (1, num_RVQ_codebooks, token_length).
### Decode
```python
# decoding from `zq`
waveform = model.decode(zq=zq) # (1, 1, audio_length); the output has a mono channel.
# decoding from `s`
waveform = model.decode(s=s) # (1, 1, audio_length); the output has a mono channel.
```
### Save a waveform as an audio file
```python
model.waveform_to_audiofile(waveform, 'out.wav')
```
### Save and load tokens
```python
model.save_tensor(s, 'tokens.pt')
loaded_s = model.load_tensor('tokens.pt')
```
# References
[1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).
<!-- contributions
- chunk processing
- add device parameter in the test notebook
-->
|