Wav2Vec2 LJSpeech Gruut

Wav2Vec2 LJSpeech Gruut is an automatic speech recognition model based on the wav2vec 2.0 architecture. This model is a fine-tuned version of Wav2Vec2-Base on the LJSpech Phonemes dataset.

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. ["h", "ɛ", "l", "ˈoʊ", "w", "ˈɚ", "l", "d"]. Therefore, the model's vocabulary contains the different IPA phonemes found in gruut.

This model was trained using HuggingFace's PyTorch framework. All training was done on a Google Cloud Engine VM with a Tesla A100 GPU. All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

Model

Model #params Arch. Training/Validation data (text)
wav2vec2-ljspeech-gruut 94M wav2vec 2.0 LJSpech Phonemes Dataset

Evaluation Results

The model achieves the following results on evaluation:

Dataset PER (w/o stress) CER (w/o stress)
LJSpech Phonemes Test Data 0.99% 0.58%

Usage

from transformers import AutoProcessor, AutoModelForCTC, Wav2Vec2Processor
import librosa
import torch
from itertools import groupby
from datasets import load_dataset

def decode_phonemes(
    ids: torch.Tensor, processor: Wav2Vec2Processor, ignore_stress: bool = False
) -> str:
    """CTC-like decoding. First removes consecutive duplicates, then removes special tokens."""
    # removes consecutive duplicates
    ids = [id_ for id_, _ in groupby(ids)]

    special_token_ids = processor.tokenizer.all_special_ids + [
        processor.tokenizer.word_delimiter_token_id
    ]
    # converts id to token, skipping special tokens
    phonemes = [processor.decode(id_) for id_ in ids if id_ not in special_token_ids]

    # joins phonemes
    prediction = " ".join(phonemes)

    # whether to ignore IPA stress marks
    if ignore_stress == True:
        prediction = prediction.replace("ˈ", "").replace("ˌ", "")

    return prediction

checkpoint = "bookbot/wav2vec2-ljspeech-gruut"

model = AutoModelForCTC.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
sr = processor.feature_extractor.sampling_rate

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_array = ds[0]["audio"]["array"]

# or, read a single audio file
# audio_array, _ = librosa.load("myaudio.wav", sr=sr)

inputs = processor(audio_array, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs["input_values"]).logits

predicted_ids = torch.argmax(logits, dim=-1)
prediction = decode_phonemes(predicted_ids[0], processor, ignore_stress=True)
# => should give 'b ɪ k ʌ z j u ɚ z s l i p ɪ ŋ ɪ n s t ɛ d ə v k ɔ ŋ k ɚ ɪ ŋ ð ə l ʌ v l i ɹ z p ɹ ɪ n s ə s h æ z b ɪ k ʌ m ə v f ɪ t ə l w ɪ θ n b oʊ p ɹ ə ʃ æ ɡ i s ɪ t s ð ɛ ɹ ə k u ɪ ŋ d ʌ v'

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 30.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
No log 1.0 348 2.2818 1.0 1.0
2.6692 2.0 696 0.2045 0.0527 0.0299
0.2225 3.0 1044 0.1162 0.0319 0.0189
0.2225 4.0 1392 0.0927 0.0235 0.0147
0.0868 5.0 1740 0.0797 0.0218 0.0143
0.0598 6.0 2088 0.0715 0.0197 0.0128
0.0598 7.0 2436 0.0652 0.0160 0.0103
0.0447 8.0 2784 0.0571 0.0152 0.0095
0.0368 9.0 3132 0.0608 0.0163 0.0112
0.0368 10.0 3480 0.0586 0.0137 0.0083
0.0303 11.0 3828 0.0641 0.0141 0.0085
0.0273 12.0 4176 0.0656 0.0131 0.0079
0.0232 13.0 4524 0.0690 0.0133 0.0082
0.0232 14.0 4872 0.0598 0.0128 0.0079
0.0189 15.0 5220 0.0671 0.0121 0.0074
0.017 16.0 5568 0.0654 0.0114 0.0069
0.017 17.0 5916 0.0751 0.0118 0.0073
0.0146 18.0 6264 0.0653 0.0112 0.0068
0.0127 19.0 6612 0.0682 0.0112 0.0069
0.0127 20.0 6960 0.0678 0.0114 0.0068
0.0114 21.0 7308 0.0656 0.0111 0.0066
0.0101 22.0 7656 0.0669 0.0109 0.0066
0.0092 23.0 8004 0.0677 0.0108 0.0065
0.0092 24.0 8352 0.0653 0.0104 0.0063
0.0088 25.0 8700 0.0673 0.0102 0.0063
0.0074 26.0 9048 0.0669 0.0105 0.0064
0.0074 27.0 9396 0.0707 0.0101 0.0061
0.0066 28.0 9744 0.0673 0.0100 0.0060
0.0058 29.0 10092 0.0689 0.0100 0.0059
0.0058 30.0 10440 0.0683 0.0099 0.0058

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

Wav2Vec2 LJSpeech Gruut was trained and evaluated by Wilson Wongso. All computation and development are done on Google Cloud.

Framework versions

  • Transformers 4.26.0.dev0
  • Pytorch 1.10.0
  • Datasets 2.7.1
  • Tokenizers 0.13.2
  • Gruut 2.3.4
Downloads last month
41
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ct-vikramanantha/phoneme-scorer-v2-wav2vec2

Evaluation results