VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository contains the weights for the official VITS checkpoint trained on the LJ Speech dataset.
VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications
Checkpoint | Train Hours | Speakers |
---|---|---|
ljspeech_vits_ms_istft | 24 | 1 |
ljspeech_vits_mb_istft | 24 | 1 |
ljspeech_vits_istft | 24 | 1 |
Usage
To use this checkpoint, first install the latest version of the library:
pip install --upgrade transformers accelerate
Then, run inference with the following code-snippet:
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
model = AutoModel.from_pretrained("anhnct/ljspeech_vits_ms_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_ms_istft")
text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
The resulting waveform can be saved as a .wav
file:
import scipy
data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(data_np_squeezed, rate=model.config.sampling_rate)
License
The model is licensed as MIT.
- Downloads last month
- 153