File size: 1,599 Bytes
a627c6d 3265ca2 a627c6d ed8c643 fa7aa0b 18308c9 3265ca2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
---
license: apache-2.0
language:
- zh
library_name: transformers.js
pipeline_tag: text-to-speech
---
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
## Model Details
Languages: Chinese
Dataset: THCHS-30
Speakers: 44
Training Hours: 48
## Usage
Using this checkpoint from Hugging Face Transformers:
```py
from transformers import VitsModel, VitsTokenizer
from pypinyin import lazy_pinyin, Style
import torch
model = VitsModel.from_pretrained("BricksDisplay/vits-cmn")
tokenizer = VitsTokenizer.from_pretrained("BricksDisplay/vits-cmn")
text = "中文"
payload = ''.join(lazy_pinyin(text, style=Style.TONE, tone_sandhi=True))
inputs = tokenizer(payload, return_tensors="pt")
with torch.no_grad():
output = model(**inputs, speaker_id=0)
from IPython.display import Audio
Audio(output.audio[0], rate=16000)
```
Using this checkpoint from Transformers.js:
```js
import { pipeline } from '@xenova/transformers';
import { pinyin } from 'pinyin-pro'; // Our use-case, using `pinyin-pro`
const synthesizer = await pipeline('text-to-audio', 'BricksDisplay/vits-cmn', { quantized: false })
console.log(await synthesizer(pinyin("中文")))
// {
// audio: Float32Array(?) [ ... ],
// sampling_rate: 16000
// }
```
Note: Transformers.js (ONNX) version does not support speaker_id, so it will fixed in 0 |