File size: 2,511 Bytes
4ad0c19
 
9642163
 
 
 
4ad0c19
9642163
 
 
 
aa88eb2
9642163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa88eb2
9642163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---

license: mit
tags:
- DAC
- Descript Audio Codec
- PyTorch
---


# Descript Audio Codec (DAC)
DAC is the state-of-the-art audio tokenizer with improvement upon the previous tokenizers like SoundStream and EnCodec.

This model card provides an easy-to-use API for a *pretrained DAC* [1] for 16khz audio whose backbone and pretrained weights are from [its original reposotiry](https://github.com/descriptinc/descript-audio-codec). With this API, you can encode and decode by a single line of code either using CPU or GPU. Furhtermore, it supports chunk-based processing for memory-efficient processing, especially important for GPU processing. 









### Model variations
There are three types of model depending on an input audio sampling rate. 

| Model | Input audio sampling rate [khz] |
| ------------------ | ----------------- |
| [`hance-ai/descript-audio-codec-44khz`](https://huggingface.co/hance-ai/descript-audio-codec-44khz) | 44.1khz |
| [`hance-ai/descript-audio-codec-24khz`](https://huggingface.co/hance-ai/descript-audio-codec-24khz) | 24khz |
| [`hance-ai/descript-audio-codec-16khz`](https://huggingface.co/hance-ai/descript-audio-codec-16khz) | 16khz |










# Usage

### Load
```python

from transformers import AutoModel



# device setting

device = 'cpu'  # or 'cuda:0'



# load

model = AutoModel.from_pretrained('hance-ai/descript-audio-codec-16khz', trust_remote_code=True)

model.to(device)

```

### Encode
```python

audio_filename = 'path/example_audio.wav'

zq, s = model.encode(audio_filename)

```
`zq` is discrete embeddings with dimension of (1, num_RVQ_codebooks, token_length) and `s` is a token sequence with dimension of (1, num_RVQ_codebooks, token_length). 


### Decode
```python

# decoding from `zq`

waveform = model.decode(zq=zq)  # (1, 1, audio_length); the output has a mono channel.



# decoding from `s`

waveform = model.decode(s=s)  # (1, 1, audio_length); the output has a mono channel.

```

### Save a waveform as an audio file
```python

model.waveform_to_audiofile(waveform, 'out.wav')

```

### Save and load tokens
```python

model.save_tensor(s, 'tokens.pt')

loaded_s = model.load_tensor('tokens.pt')

```










# References
[1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).



<!-- contributions 
- chunk processing 
- add device parameter in the test notebook
-->