YasinShihab commited on
Commit
287518d
·
1 Parent(s): 93ffe66

Created readme.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Bengali
3
+ datasets:
4
+ - OpenSLR
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - bn
9
+ - audio
10
+ - automatic-speech-recognition
11
+ - speech
12
+ license: cc-by-sa-4.0
13
+ model-index:
14
+ - name: XLSR Wav2Vec2 Bengali by Arijit
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: OpenSLR
21
+ type: OpenSLR
22
+ args: ben
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 32.45
27
+ ---
28
+ # Wav2Vec2-Large-XLSR-Bengali
29
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) Bengali using a subset of 40,000 utterances from [Bengali ASR training data set containing ~196K utterances](https://www.openslr.org/53/). Tested WER using ~4200 held out from training.
30
+ When using this model, make sure that your speech input is sampled at 16kHz.
31
+ Train Script can be Found at : train.py
32
+ Data Prep Notebook : https://colab.research.google.com/drive/1JMlZPU-DrezXjZ2t7sOVqn7CJjZhdK2q?usp=sharing
33
+ Inference Notebook : https://colab.research.google.com/drive/1uKC2cK9JfUPDTUHbrNdOYqKtNozhxqgZ?usp=sharing
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+ ```python
38
+ import torch
39
+ import torchaudio
40
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
41
+ processor = Wav2Vec2Processor.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
42
+ model = Wav2Vec2ForCTC.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
43
+ # model = model.to("cuda")
44
+ resampler = torchaudio.transforms.Resample(TEST_AUDIO_SR, 16_000)
45
+ def speech_file_to_array_fn(batch):
46
+ speech_array, sampling_rate = torchaudio.load(batch)
47
+ speech = resampler(speech_array).squeeze().numpy()
48
+ return speech
49
+ speech_array = speech_file_to_array_fn("test_file.wav")
50
+ inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
51
+ with torch.no_grad():
52
+ logits = model(inputs.input_values).logits
53
+
54
+ predicted_ids = torch.argmax(logits, dim=-1)
55
+ preds = processor.batch_decode(predicted_ids)[0]
56
+ print(preds.replace("[PAD]",""))
57
+ ```
58
+ **Test Result**: WER on ~4200 utterance : 32.45 %