yash072
/

wav2vec2-large-XLSR-Hindi-YashR

@@ -1,15 +1,158 @@
 ---
 license: apache-2.0
 datasets:
-- mozilla-foundation/common_voice_13_0
 - mozilla-foundation/common_voice_17_0
 language:
 - hi
 metrics:
 - wer
 base_model:
 - theainerd/Wav2Vec2-large-xlsr-hindi
-new_version: yash072/wav2vec2-large-xlsr-YashHindi-4
 pipeline_tag: automatic-speech-recognition
 library_name: transformers
----

 ---
 license: apache-2.0
 datasets:
 - mozilla-foundation/common_voice_17_0
+- mozilla-foundation/common_voice_13_0
 language:
 - hi
 metrics:
 - wer
 base_model:
 - theainerd/Wav2Vec2-large-xlsr-hindi
 pipeline_tag: automatic-speech-recognition
 library_name: transformers
+---
+# Model's Improvment
+This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.
+# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker
+This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.
+## Model description
+This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.
+## Intended uses & limitations
+This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.
+## Usage
+The model can be used directly (without a language model) as follows:
+```python
+import torch
+import torchaudio
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+# Load the Hindi Common Voice dataset
+test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
+# Load the processor and model
+processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
+model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
+resampler = torchaudio.transforms.Resample(48_000, 16_000)
+# Function to process the dataset
+def speech_file_to_array_fn(batch):
+  speech_array, sampling_rate = torchaudio.load(batch["path"])
+  batch["speech"] = resampler(speech_array).squeeze().numpy()
+  return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
+# Perform inference
+with torch.no_grad():
+  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+print("Prediction:", processor.batch_decode(predicted_ids))
+print("Reference:", test_dataset["sentence"][:2])
+# Evaluation
+The model can be evaluated as follows on the Hindi test data of Common Voice.
+```python
+import torch
+import torchaudio
+from datasets import load_dataset, load_metric
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+import re
+# Load the dataset and metrics
+test_dataset = load_dataset("common_voice", "hi", split="test")
+wer = load_metric("wer")
+# Initialize processor and model
+processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
+model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
+model.to("cuda")
+resampler = torchaudio.transforms.Resample(48_000, 16_000)
+chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
+# Function to preprocess the data
+def speech_file_to_array_fn(batch):
+  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
+  speech_array, sampling_rate = torchaudio.load(batch["path"])
+  batch["speech"] = resampler(speech_array).squeeze().numpy()
+  return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+# Evaluation function
+def evaluate(batch):
+  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
+  with torch.no_grad():
+      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
+      pred_ids = torch.argmax(logits, dim=-1)
+      batch["pred_strings"] = processor.batch_decode(pred_ids)
+      return batch
+result = test_dataset.map(evaluate, batched=True, batch_size=8)
+print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
+### Limitations:
+- The model may face challenges with dialectal or regional variations within Hindi.
+- Performance can degrade with noisy audio or overlapping speech.
+- It is not intended for real-time transcription due to latency considerations.
+## Training and evaluation data
+The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.
+## Training procedure
+### Hyperparameters and setup:
+The following hyperparameters were used during training:
+- **Learning rate**: 1e-4
+- **Batch size**: 16 (per device)
+- **Gradient accumulation steps**: 2
+- **Evaluation strategy**: steps
+- **Max steps**: 2500
+- **Mixed precision**: FP16
+- **Save steps**: 500
+- **Evaluation steps**: 500
+- **Logging steps**: 500
+- **Warmup steps**: 500
+- **Save total limit**: 1
+### Training output
+- **Global step**: 2500
+- **Training runtime**: Approximately 1 hour 21 minutes
+- **Epochs**: 5-6
+### Training results
+| Step | Training Loss | Validation Loss | WER    |
+|------|---------------|-----------------|--------|
+| 500  | 5.603000      | 0.987691       | 0.7556 |
+| 1000 | 0.720300      | 0.667561       | 0.6196 |
+| 1500 | 0.507000      | 0.592814       | 0.5844 |
+| 2000 | 0.431100      | 0.549786       | 0.5439 |
+| 2500 | 0.395600      | 0.537703       | 0.5428 |
+### Framework versions
+Transformers: 4.42.4
+PyTorch: 2.3.1+cu121
+Datasets: 2.20.0
+Tokenizers: 0.19.1