Update README.md
Browse files
README.md
CHANGED
@@ -5,3 +5,39 @@ language:
|
|
5 |
library_name: transformers.js
|
6 |
---
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
library_name: transformers.js
|
6 |
---
|
7 |
|
8 |
+
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
|
9 |
+
|
10 |
+
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
|
11 |
+
|
12 |
+
## Model Details
|
13 |
+
|
14 |
+
Languages: Chinese
|
15 |
+
|
16 |
+
Dataset: THCHS-30
|
17 |
+
|
18 |
+
Speakers: 44
|
19 |
+
|
20 |
+
Training Hours: 48
|
21 |
+
|
22 |
+
## Usage
|
23 |
+
|
24 |
+
Using this checkpoint from Hugging Face Transformers:
|
25 |
+
|
26 |
+
```py
|
27 |
+
from transformers import VitsModel, VitsTokenizer
|
28 |
+
from pypinyin import lazy_pinyin, Style
|
29 |
+
import torch
|
30 |
+
|
31 |
+
model = VitsModel.from_pretrained("BricksDisplay/vits-cmn")
|
32 |
+
tokenizer = VitsTokenizer.from_pretrained("BricksDisplay/vits-cmn")
|
33 |
+
|
34 |
+
text = "中文"
|
35 |
+
payload = ''.join(lazy_pinyin(text, style=Style.TONE, tone_sandhi=True))
|
36 |
+
inputs = tokenizer(payload, return_tensors="pt")
|
37 |
+
|
38 |
+
with torch.no_grad():
|
39 |
+
output = model(**inputs, speaker_id=0)
|
40 |
+
|
41 |
+
from IPython.display import Audio
|
42 |
+
Audio(output.audio[0], rate=16000)
|
43 |
+
```
|