cobrayyxx
/

whisper-small-indo-eng

@@ -14,23 +14,24 @@ should probably proofread and complete it, then remove this comment. -->
 # whisper-small-indo-eng
-This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an unknown dataset.
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 1e-05
@@ -40,16 +41,93 @@ The following hyperparameters were used during training:
 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 500
-- training_steps: 100
 - mixed_precision_training: Native AMP
-### Training results
-### Framework versions
 - Transformers 4.46.3
 - Pytorch 2.5.1+cu121
 - Datasets 3.2.0
 - Tokenizers 0.20.3

 # whisper-small-indo-eng
 ## Model description
+This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an [cobrayyxx/FLEURS_INDO-ENG_Speech_Translation](https://huggingface.co/datasets/cobrayyxx/FLEURS_INDO-ENG_Speech_Translation) dataset.
+## Dataset: FLEURS_INDO-ENG_Speech_Translation
+This model was fine-tuned using the `cobrayyxx/FLEURS_INDO-ENG_Speech_Translation` dataset, a speech translation dataset for the **Indonesian ↔ English** language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.
+### Key Features:
+- **audio**: Audio clip in Bahasa/Indonesian
+- **text_indo**: Audio transcription in Bahasa/Indonesian.
+- **text_en**: Audio transcription in English.
+### Dataset Usage
+- **Training Data**: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation.
+- **Validation Data**: Used to evaluate the performance of the model during training.
+## Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 1e-05
 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 500
+- training_steps (epoch): 100
 - mixed_precision_training: Native AMP
+## Model Evaluation
+The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset.
+This fine-tuned model shows a slight improvement over the baseline model.
+| Model           | BLEU Score | CHRF Score |
+|------------------|------------|------------|
+| Baseline Model   | **33.03**  | **52.71**  |
+| Fine-Tuned Model | **34.82**  | **61.45**  |
+### Evaluation Details
+- **BLEU**: Measures the overlap between predicted and reference text based on n-grams.
+- **CHRF**: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.
+### Reproduce Steps
+After [training](https://huggingface.co/blog/fine-tune-whisper) and push the training model to hugging-face.
+we have to follow several steps before we can evaluate it:
+1. Push tokenizer manually by creating it from WhisperTokenizerFast.
+     ```
+     from transformers import WhisperTokenizerFast
+    # Load your fine-tuned tokenizer
+    tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
+    # Save the tokenizer locally
+    tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
+    # Push the tokenizer to the Hugging Face Hub
+    tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
+     ```
+2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
+    ```
+    !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
+    ```
+3. Load the model for WhisperModel with your ct2-model, in this case is `cobrayyxx/whisper-small-indo-eng-ct2`.
+4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
+   ```
+     def predict(audio_array):
+    model_name = "cobrayyxx/whisper-small-indo-eng-ct2"  # pretrained model - try "tiny", "base", "small", or "medium"
+    model = WhisperModel(model_name, device="cuda", compute_type="float16")
+    segments, info = model.transcribe(audio_array,
+                                      beam_size=5,
+                                      language="en",
+                                      vad_filter=True
+                                      )
+    return segments, info
+    def metric_calculation(dataset):
+        val_data = fleurs_dataset["validation"]
+        bleu = BLEU()
+        chrf = CHRF()
+        lst_pred = []
+        lst_gold = []
+        for data in tqdm(val_data):
+            gold_standard = data["text_en"]
+            gold_standard = gold_standard.lower().strip()
+            audio_array = data["audio"]["array"]
+            # Ensure it's 1D
+            audio_array = np.ravel(audio_array)
+            # Convert to float32 if necessary
+            audio_array = audio_array.astype(np.float32)
+            pred_segments, pred_info = predict(audio_array)
+            prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
+            lst_pred.append(prediction_text)
+            lst_gold.append([gold_standard])
+        bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
+        chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
+        return bleu_score, chrf_score
+     ```
+     Now run the evaluation.
+     ```
+     pretrain_bleu_score, pretrain_chrf_score   = metric_calculation(fleurs_dataset)
+     pretrain_bleu_score, pretrain_chrf_score
+     ```
+## Framework versions
 - Transformers 4.46.3
 - Pytorch 2.5.1+cu121
 - Datasets 3.2.0
 - Tokenizers 0.20.3
+## Reference
+- https://huggingface.co/blog/fine-tune-whisper
+## Credits
+Huge thanks to [Yasmin Moslem ](https://huggingface.co/ymoslem) for mentoring me.