hash2004
/

parakeet-fine-tuned-urdu

Automatic Speech Recognition

Model card Files Files and versions Community

hash2004 commited on Nov 30, 2024

Commit

220e6bc

·

verified ·

1 Parent(s): 52bb174

Update README.md

Files changed (1) hide show

README.md +92 -3

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
----
-license: mit
----

+---
+language:
+- ur
+library_name: nemo
+datasets:
+- mozilla-foundation/common_voice_12_0
+thumbnail: null
+tags:
+- automatic-speech-recognition
+- speech
+- audio
+- Transducer
+- FastConformer
+- Conformer
+- pytorch
+- NeMo
+license: cc-by-4.0
+widget:
+- Title: Common Voice Urdu Sample
+  src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac
+model-index:
+- name: parakeet-rnnt-0.6b-urdu
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Mozilla Common Voice 12.0 (Urdu)
+      type: mozilla-foundation/common_voice_12_0
+      split: test
+      args:
+        language: ur
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 25.513
+metrics:
+- wer
+pipeline_tag: automatic-speech-recognition
+---
+# Fine-Tuned Parakeet RNNT 0.6B (Urdu)
+This repository contains the fine-tuned version of the **Parakeet RNNT 0.6B** model for **Urdu** Automatic Speech Recognition (ASR). The base model, developed by **NVIDIA NeMo** and **Suno.ai**, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation.
+---
+## Model Overview
+The **Parakeet RNNT** is an XL version of the FastConformer Transducer with **600 million parameters**, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces.
+Base model details can be found on 🤗 [Hugging Face](https://huggingface.co/nvidia/parakeet-rnnt-0.6b).
+---
+## Training Details
+### Dataset
+The fine-tuning was performed using the **Urdu dataset** from Mozilla's [Common Voice 12.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0). This dataset provides diverse speech samples in Urdu, ensuring robust training.
+### Hardware
+- **Google Colab Pro**
+- **NVIDIA A100 GPU**
+- Fine-tuning duration: **5 hours**
+- GPU utilization: ~25%
+---
+## Results
+The model achieved a **Word Error Rate (WER)** of **25.513%** on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions:
+- **Reference**: کچھ بھی ہو سکتا ہے۔
+  **Predicted**: کچھ بھی ہو سکتا ہے۔
+---
+- **Reference**: اورکوئی جمہوریت کو کوس رہا ہے۔
+  **Predicted**: اور کوئ جمہوریت کو  کو س رہا ہے۔
+This WER is slightly higher than OpenAI's **Whisper model**, which achieved **23%** without fine-tuning (\href{https://arxiv.org/html/2409.11252v1}{reference}), but demonstrates the potential of the Parakeet RNNT with further fine-tuning.
+---
+## How to Use this Model
+### Loading the Model
+You can load the fine-tuned model using NVIDIA NeMo:
+```python
+import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu")