language:
- nl
tags:
- automatic-speech-recognition
- mozilla-foundation/common_voice_8_0
- robust-speech-event
- model_for_talk
- nl
- nl_NL
- nl_BE
datasets:
- mozilla-foundation/common_voice_8_0
model-index:
- name: xls-r-nl-v1-cv8-lm
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 8
type: mozilla-foundation/common_voice_8_0
args: nl
metrics:
- name: Test WER
type: wer
value: 3.93
- name: Test CER
type: cer
value: 1.22
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event - Dev Data
type: speech-recognition-community-v2/dev_data
args: nl
metrics:
- name: Test WER
type: wer
value: 16.35
- name: Test CER
type: cer
value: 9.64
XLS-R-based CTC model with 5-gram language model from Open Subtitles
This model is a version of facebook/wav2vec2-xls-r-2b-22-to-16 fine-tuned mainly on the CGN dataset, as well as the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - NL dataset (see details below), on which a large 5-gram language model is added based on the Open Subtitles Dutch corpus. This model achieves the following results on the evaluation set (of Common Voice 8.0):
- Wer: 0.03931
- Cer: 0.01224
IMPORTANT NOTE: The
hunspell
typo fixer is not enabled on the website, which returns raw CTC+LM results. Hunspell reranking is only available in theeval.py
decoding script. For best results, please use the code in that file while using the model locally for inference.
IMPORTANT NOTE: Evaluating this model requires
apt install libhunspell-dev
and a pip install ofhunspell
in addition to pip installs ofpipy-kenlm
andpyctcdecode
(seeinstall_requirements.sh
); in addition, the chunking lengths and strides were optimized for the model as12s
and2s
respectively (seeeval.sh
).
Model description
The model takes 16kHz sound input, and uses a Wav2Vec2ForCTC decoder with 48 letters to output the letter-transcription probabilities per frame.
To improve accuracy, a beam-search decoder based on pyctcdecode
is then used; it reranks the most promising alignments based on a 5-gram language model trained on the Open Subtitles Dutch corpus.
To further deal with typos, hunspell
is used to propose alternative spellings for words not in the unigrams of the language model. These alternatives are then reranked based on the language model trained above, and a penalty proportional to the levenshtein edit distance between the alternative and the recognized word. This for examples enables to correct collegas
into collega's
or gogol
into google
.
Intended uses & limitations
This model can be used to transcribe Dutch or Flemish spoken dutch to text (without punctuation).
Training and evaluation data
The model was:
- initialized with the 2B parameter model from Facebook.
- trained
5
epochs (6000 iterations of batch size 32) on thecv8/nl
dataset. - trained
1
epoch (36000 iterations of batch size 32) on thecgn
dataset. - trained
5
epochs (6000 iterations of batch size 32) on thecv8/nl
dataset.
Framework versions
- Transformers 4.16.0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0