Wav2Vec2_XLS-R-300m_Nepali_ASR
This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
Model description
The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%
Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.
Training and evaluation data
Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on CommonVoice v17.0 ne-NP which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the
validated
andother
split first since the dataset is very small. So, we had a total of 1337 utterances. - We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation.
- And, we used the
test
split consisting of 217 utterances for testing. (It might have been present in thetrain split
as well.) - It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.
OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded here.
- 80% was used for training, 10% for evaluation and 10% for testing.
Training procedure
Training on CommonVoice 17.0
The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP
Initial Training on OpenSLR-54 for 16 epochs
The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP
Further Training on OpenSLR-54 for further 3 epochs
We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1
- Downloads last month
- 68
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR
Base model
facebook/wav2vec2-xls-r-300m