yash072 commited on
Commit
306715c
·
verified ·
1 Parent(s): beca360

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -3
README.md CHANGED
@@ -1,15 +1,158 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - mozilla-foundation/common_voice_13_0
5
  - mozilla-foundation/common_voice_17_0
 
6
  language:
7
  - hi
8
  metrics:
9
  - wer
10
  base_model:
11
  - theainerd/Wav2Vec2-large-xlsr-hindi
12
- new_version: yash072/wav2vec2-large-xlsr-YashHindi-4
13
  pipeline_tag: automatic-speech-recognition
14
  library_name: transformers
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  datasets:
 
4
  - mozilla-foundation/common_voice_17_0
5
+ - mozilla-foundation/common_voice_13_0
6
  language:
7
  - hi
8
  metrics:
9
  - wer
10
  base_model:
11
  - theainerd/Wav2Vec2-large-xlsr-hindi
 
12
  pipeline_tag: automatic-speech-recognition
13
  library_name: transformers
14
+ ---
15
+ # Model's Improvment
16
+
17
+ This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.
18
+
19
+ # Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker
20
+
21
+ This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.
22
+
23
+ ## Model description
24
+
25
+ This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.
26
+
27
+ ## Intended uses & limitations
28
+
29
+ This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.
30
+
31
+
32
+ ## Usage
33
+
34
+ The model can be used directly (without a language model) as follows:
35
+
36
+ ```python
37
+ import torch
38
+ import torchaudio
39
+ from datasets import load_dataset
40
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
41
+
42
+ # Load the Hindi Common Voice dataset
43
+ test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
44
+
45
+ # Load the processor and model
46
+ processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
47
+ model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
48
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
49
+
50
+ # Function to process the dataset
51
+ def speech_file_to_array_fn(batch):
52
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
53
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
54
+ return batch
55
+
56
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
57
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
58
+
59
+ # Perform inference
60
+ with torch.no_grad():
61
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
62
+
63
+ predicted_ids = torch.argmax(logits, dim=-1)
64
+ print("Prediction:", processor.batch_decode(predicted_ids))
65
+ print("Reference:", test_dataset["sentence"][:2])
66
+
67
+ # Evaluation
68
+ The model can be evaluated as follows on the Hindi test data of Common Voice.
69
+
70
+ ```python
71
+ import torch
72
+ import torchaudio
73
+ from datasets import load_dataset, load_metric
74
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
75
+ import re
76
+
77
+ # Load the dataset and metrics
78
+ test_dataset = load_dataset("common_voice", "hi", split="test")
79
+ wer = load_metric("wer")
80
+
81
+ # Initialize processor and model
82
+ processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
83
+ model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
84
+ model.to("cuda")
85
+
86
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
87
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
88
+
89
+ # Function to preprocess the data
90
+ def speech_file_to_array_fn(batch):
91
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
92
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
93
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
94
+ return batch
95
+
96
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
97
+
98
+ # Evaluation function
99
+ def evaluate(batch):
100
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
101
+ with torch.no_grad():
102
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
103
+ pred_ids = torch.argmax(logits, dim=-1)
104
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
105
+ return batch
106
+
107
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
108
+ print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
109
+
110
+
111
+
112
+ ### Limitations:
113
+ - The model may face challenges with dialectal or regional variations within Hindi.
114
+ - Performance can degrade with noisy audio or overlapping speech.
115
+ - It is not intended for real-time transcription due to latency considerations.
116
+
117
+ ## Training and evaluation data
118
+
119
+ The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.
120
+
121
+ ## Training procedure
122
+
123
+ ### Hyperparameters and setup:
124
+
125
+ The following hyperparameters were used during training:
126
+ - **Learning rate**: 1e-4
127
+ - **Batch size**: 16 (per device)
128
+ - **Gradient accumulation steps**: 2
129
+ - **Evaluation strategy**: steps
130
+ - **Max steps**: 2500
131
+ - **Mixed precision**: FP16
132
+ - **Save steps**: 500
133
+ - **Evaluation steps**: 500
134
+ - **Logging steps**: 500
135
+ - **Warmup steps**: 500
136
+ - **Save total limit**: 1
137
+
138
+ ### Training output
139
+
140
+ - **Global step**: 2500
141
+ - **Training runtime**: Approximately 1 hour 21 minutes
142
+ - **Epochs**: 5-6
143
+
144
+ ### Training results
145
+
146
+ | Step | Training Loss | Validation Loss | WER |
147
+ |------|---------------|-----------------|--------|
148
+ | 500 | 5.603000 | 0.987691 | 0.7556 |
149
+ | 1000 | 0.720300 | 0.667561 | 0.6196 |
150
+ | 1500 | 0.507000 | 0.592814 | 0.5844 |
151
+ | 2000 | 0.431100 | 0.549786 | 0.5439 |
152
+ | 2500 | 0.395600 | 0.537703 | 0.5428 |
153
+
154
+ ### Framework versions
155
+ Transformers: 4.42.4
156
+ PyTorch: 2.3.1+cu121
157
+ Datasets: 2.20.0
158
+ Tokenizers: 0.19.1