File size: 1,599 Bytes
a627c6d
 
 
 
 
3265ca2
a627c6d
 
ed8c643
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa7aa0b
 
 
 
 
 
 
 
 
 
 
 
 
18308c9
 
3265ca2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: apache-2.0
language:
- zh
library_name: transformers.js
pipeline_tag: text-to-speech
---

# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.

## Model Details

Languages: Chinese

Dataset: THCHS-30

Speakers: 44

Training Hours: 48

## Usage

Using this checkpoint from Hugging Face Transformers:

```py
from transformers import VitsModel, VitsTokenizer
from pypinyin import lazy_pinyin, Style
import torch

model = VitsModel.from_pretrained("BricksDisplay/vits-cmn")
tokenizer = VitsTokenizer.from_pretrained("BricksDisplay/vits-cmn")

text = "中文"
payload = ''.join(lazy_pinyin(text, style=Style.TONE, tone_sandhi=True))
inputs = tokenizer(payload, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs, speaker_id=0)

from IPython.display import Audio
Audio(output.audio[0], rate=16000)
```

Using this checkpoint from Transformers.js:

```js
import { pipeline } from '@xenova/transformers';
import { pinyin } from 'pinyin-pro'; // Our use-case, using `pinyin-pro`

const synthesizer = await pipeline('text-to-audio', 'BricksDisplay/vits-cmn', { quantized: false })
console.log(await synthesizer(pinyin("中文")))
// {
//   audio: Float32Array(?) [ ... ],
//   sampling_rate: 16000
// }
```

Note: Transformers.js (ONNX) version does not support speaker_id, so it will fixed in 0