|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- ca |
|
datasets: |
|
- mythicinfinity/libritts_r |
|
- projecte-aina/festcat_trimmed_denoised |
|
- projecte-aina/openslr-slr69-ca-trimmed-denoised |
|
- keithito/lj_speech |
|
base_model: |
|
- facebook/encodec_24khz |
|
--- |
|
|
|
# Wavenext-encodec |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
Wavenext is a modification of Vocos, where the last ISTFT layer is replaced with a a trainable linear layer that can directly predict speech waveform samples. |
|
|
|
This version of Wavenext uses encodec tokens as input features, it's trained using the following bandwidths from encodec (1.5, 3.0, 6.0, 12.0) . |
|
|
|
## Intended Uses and limitations |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
The model is aimed to serve as a vocoder to synthesize audio waveforms from encodec discrete codes. Is trained to generate speech and if is used in other audio |
|
domain is possible that the model won't produce high quality samples. |
|
|
|
## Usage |
|
### Installation |
|
|
|
To use Wavenext only in inference mode, install it using: |
|
|
|
```bash |
|
pip install git+https://github.com/langtech-bsc/wavenext_pytorch |
|
``` |
|
|
|
### Reconstruct audio from encodec tokens |
|
|
|
You need to provide a bandwidth_id which corresponds to the embedding for bandwidth from the list: [1.5, 3.0, 6.0, 12.0]. |
|
|
|
```python |
|
import torch |
|
|
|
from vocos import Vocos |
|
|
|
vocos = Vocos.from_pretrained("BSC-LT/wavenext-encodec") |
|
|
|
audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames |
|
features = vocos.codes_to_features(audio_tokens) |
|
bandwidth_id = torch.tensor([2]) # 6 kbps |
|
|
|
audio = vocos.decode(features, bandwidth_id=bandwidth_id) |
|
|
|
``` |
|
|
|
Copy-synthesis from a file: |
|
|
|
```python |
|
import torchaudio |
|
|
|
y, sr = torchaudio.load(YOUR_AUDIO_FILE) |
|
if y.size(0) > 1: # mix to mono |
|
y = y.mean(dim=0, keepdim=True) |
|
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000) |
|
y_hat = vocos(y, bandwidth_id=bandwidth_id) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was trained on 4 speech datasets |
|
|
|
| Dataset | Language | Hours | |
|
|---------------------|----------|---------| |
|
| LibriTTS-r | en | 585 | |
|
| LJSpeech | en | 24 | |
|
| Festcat | ca | 22 | |
|
| OpenSLR69 | ca | 5 | |
|
|
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
The model was trained for 1M steps and 99 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 1e-4. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* initial_learning_rate: 1e-4 |
|
* scheduler: cosine without warmup or restarts |
|
* mel_loss_coeff: 45 |
|
* mrd_loss_coeff: 0.1 |
|
* batch_size: 16 |
|
* num_samples: 16384 |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
Evaluation was done using the metrics on the original vocos repo, Note that this metrics are calculated using the codecs corresponding to a bandwidth of 1.5 kbps, after 99 epochs we achieve: |
|
|
|
* val_loss: 5.52 |
|
* f1_score: 0.93 |
|
* mel_loss: 0.53 |
|
* periodicity_loss:0.14 |
|
* pesq_score: 2.12 |
|
* pitch_loss: 47.73 |
|
* utmos_score: 2.89 |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
If this code contributes to your research, please cite the work: |
|
|
|
``` |
|
@INPROCEEDINGS{10389765, |
|
author={Okamoto, Takuma and Yamashita, Haruki and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi}, |
|
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, |
|
title={WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layer}, |
|
year={2023}, |
|
volume={}, |
|
number={}, |
|
pages={1-8}, |
|
keywords={Fourier transforms;Vocoders;Conferences;Automatic speech recognition;ConvNext;end-to-end text-to-speech;linear layer-based upsampling;neural vocoder;Vocos}, |
|
doi={10.1109/ASRU57964.2023.10389765}} |
|
|
|
@article{siuzdak2023vocos, |
|
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, |
|
author={Siuzdak, Hubert}, |
|
journal={arXiv preprint arXiv:2306.00814}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <[email protected]>. |
|
|
|
### Copyright |
|
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
|
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |