File size: 3,156 Bytes
220e6bc d1dbf0a 220e6bc 1098bb2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
language:
- ur
library_name: nemo
datasets:
- mozilla-foundation/common_voice_12_0
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- FastConformer
- Conformer
- pytorch
- NeMo
license: cc-by-4.0
widget:
- Title: Common Voice Urdu Sample
src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac
model-index:
- name: parakeet-rnnt-0.6b-urdu
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 12.0 (Urdu)
type: mozilla-foundation/common_voice_12_0
split: test
args:
language: ur
metrics:
- name: Test WER
type: wer
value: 25.513
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# Fine-Tuned Parakeet RNNT 0.6B (Urdu)
This repository contains the fine-tuned version of the **Parakeet RNNT 0.6B** model for **Urdu** Automatic Speech Recognition (ASR). The base model, developed by **NVIDIA NeMo** and **Suno.ai**, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation.
---
## Model Overview
The **Parakeet RNNT** is an XL version of the FastConformer Transducer with **600 million parameters**, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces.
Base model details can be found on 🤗 [Hugging Face](https://huggingface.co/nvidia/parakeet-rnnt-0.6b).
---
## Training Details
### Dataset
The fine-tuning was performed using the **Urdu dataset** from Mozilla's [Common Voice 12.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0). This dataset provides diverse speech samples in Urdu, ensuring robust training.
### Hardware
- **Google Colab Pro**
- **NVIDIA A100 GPU**
---
## Results
The model achieved a **Word Error Rate (WER)** of **25.513%** on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions:
- **Reference**: کچھ بھی ہو سکتا ہے۔
**Predicted**: کچھ بھی ہو سکتا ہے۔
---
- **Reference**: اورکوئی جمہوریت کو کوس رہا ہے۔
**Predicted**: اور کوئ جمہوریت کو کو س رہا ہے۔
This WER is slightly higher than OpenAI's **Whisper model**, which achieved **23%** without fine-tuning ([reference](https://arxiv.org/html/2409.11252v1)), but demonstrates the potential of the Parakeet RNNT with further fine-tuning.
---
## How to Use this Model
### Loading the Model
You can load the fine-tuned model using NVIDIA NeMo:
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu")
```
## How to Fine Tune this Model
You can find all resources on fine-tuning the Parakeet RNNT (0.6B) model on [this GitHub Repository](https://github.com/hash2004/conformer-fine-tuned-urdu).
|