simonsr commited on
Commit
6534936
·
1 Parent(s): 3bcd43c

Create README.md

Browse files

created readme. md (sans evaluation results)

Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ language: nl
2
+ datasets:
3
+ - common_voicemetrics:
4
+ - wer
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ - speech
9
+ - xlsr-fine-tuning-week
10
+ license: apache-2.0
11
+ model-index:
12
+ - name: `simonsr XLSR Wav2Vec2 Large 53`
13
+ results:
14
+ - task:
15
+ name: Speech Recognition
16
+ type: automatic-speech-recognition
17
+ dataset:
18
+ name: Common Voice nl
19
+ type: common_voice
20
+ args: nl
21
+ metrics:
22
+ - name: Test WER
23
+ type: wer
24
+ value: {wer_result_on_test} #TODO (IMPORTANT): replace {wer_result_on_test} with the WER error rate you achieved on the common_voice test set. It should be in the format XX.XX (don't add the % sign here). **Please** remember to fill out this value after you evaluated your model, so that your model appears on the leaderboard. If you fill out this model card before evaluating your model, please remember to edit the model card afterward to fill in your value
25
+ ---
26
+
27
+ # Wav2Vec2-Large-XLSR-53-Dutch
28
+
29
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Dutch using the [Common Voice](https://huggingface.co/datasets/common_voice)
30
+
31
+ When using this model, make sure that your speech input is sampled at 16kHz.
32
+
33
+ ## Usage
34
+
35
+ The model can be used directly (without a language model) as follows:
36
+
37
+ ```python
38
+ import torch
39
+ import torchaudio
40
+ from datasets import load_dataset
41
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
42
+
43
+ test_dataset = load_dataset("common_voice", "nl", split="test[:2%]")
44
+
45
+ processor = Wav2Vec2Processor.from_pretrained("simonsr/wav2vec2-large-xlsr-dutch")
46
+ model = Wav2Vec2ForCTC.from_pretrained("simonsr/wav2vec2-large-xlsr-dutch")
47
+
48
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
49
+
50
+ # Preprocessing the datasets.
51
+ # We need to read the audio files as arrays
52
+ def speech_file_to_array_fn(batch):
53
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
54
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
55
+ return batch
56
+
57
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
58
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
59
+
60
+ with torch.no_grad():
61
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
62
+
63
+ predicted_ids = torch.argmax(logits, dim=-1)
64
+
65
+ print("Prediction:", processor.batch_decode(predicted_ids))
66
+ print("Reference:", test_dataset["sentence"][:2])
67
+ ```
68
+
69
+
70
+ ## Evaluation
71
+
72
+ The model can be evaluated as follows on the Dutch test data of Common Voice.
73
+
74
+ ```python
75
+ import torch
76
+ import torchaudio
77
+ from datasets import load_dataset, load_metric
78
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
79
+ import unidecode
80
+ import re
81
+
82
+ test_dataset = load_dataset("common_voice", "nl", split="test")
83
+ wer = load_metric("wer")
84
+
85
+ processor = Wav2Vec2Processor.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
86
+ model = Wav2Vec2ForCTC.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
87
+ model.to("cuda")
88
+
89
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\(\)\=\´\–\&\…\—\’]'
90
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
91
+
92
+ # Preprocessing the datasets.
93
+ # We need to read the aduio files as arrays
94
+ def speech_file_to_array_fn(batch):
95
+ batch["sentence"] = unidecode.unidecode(batch["sentence"])
96
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
97
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
98
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
99
+ return batch
100
+
101
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
102
+
103
+ # Preprocessing the datasets.
104
+ # We need to read the aduio files as arrays
105
+ def evaluate(batch):
106
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
107
+
108
+ with torch.no_grad():
109
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
110
+
111
+ pred_ids = torch.argmax(logits, dim=-1)
112
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
113
+ return batch
114
+
115
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
116
+
117
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
118
+ ```
119
+
120
+ **Test Result**: XX.XX % # TODO: write output of print here. IMPORTANT: Please remember to also replace {wer_result_on_test} at the top of with this value here. tags.
121
+
122
+
123
+ ## Training
124
+
125
+ The Common Voice `train`, `validation`, and ... datasets were used for training.
126
+
127
+ The script used for training can be found [here](...) # TODO: fill in a link to your training script here. If you trained your model in a colab, simply fill in the link here. If you trained the model locally, it would be great if you could upload the training script on github and paste the link here.