Malayalam Text-to-Speech

This repository contains the Swaram (mal) text-to-speech (TTS) model checkpoint.

Model Details

Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.

Swaram's text encoder is built on top of the Wav2Vec2 decoder. A VAE is used as the decoder. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.

Architecture

architecture

Usage

pip install --upgrade transformers accelerate

Then, run inference with the following code-snippet:

from transformers import VitsModel, AutoTokenizer
import torch

model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")

text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform

The resulting waveform can be saved as a .wav file:

import scipy

scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)

Or displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(output, rate=model.config.sampling_rate)

License

The model is licensed as CC-BY-NC 4.0.

Downloads last month
618
Safetensors
Model size
36.3M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Collection including aoxo/swaram