cobrayyxx's picture
Update README.md
2f2761c verified
metadata
library_name: transformers
license: apache-2.0
base_model: openai/whisper-small
tags:
  - generated_from_trainer
model-index:
  - name: whisper-small-indo-eng
    results: []

whisper-small-indo-eng

Model description

This model is a fine-tuned version of openai/whisper-small on an cobrayyxx/FLEURS_INDO-ENG_Speech_Translation dataset.

Dataset: FLEURS_INDO-ENG_Speech_Translation

This model was fine-tuned using the cobrayyxx/FLEURS_INDO-ENG_Speech_Translation dataset, a speech translation dataset for the Indonesian ↔ English language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.

Key Features:

  • audio: Audio clip in Bahasa/Indonesian
  • text_indo: Audio transcription in Bahasa/Indonesian.
  • text_en: Audio transcription in English.

Dataset Usage

  • Training Data: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation.
  • Validation Data: Used to evaluate the performance of the model during training.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps (epoch): 100
  • mixed_precision_training: Native AMP

Model Evaluation

The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset. This fine-tuned model shows a slight improvement over the baseline model.

Model BLEU Score CHRF Score
Baseline Model 33.03 52.71
Fine-Tuned Model 34.82 61.45

Evaluation Details

  • BLEU: Measures the overlap between predicted and reference text based on n-grams.
  • CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

Reproduce Steps

After training and push the training model to hugging-face. we have to follow several steps before we can evaluate it:

  1. Push tokenizer manually by creating it from WhisperTokenizerFast.
    from transformers import WhisperTokenizerFast
    
     # Load your fine-tuned tokenizer
     tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
     
     # Save the tokenizer locally
     tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
     
     # Push the tokenizer to the Hugging Face Hub
     tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
    
  2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
    !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
    
  3. Load the model for WhisperModel with your ct2-model, in this case is cobrayyxx/whisper-small-indo-eng-ct2.
  4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
      def predict(audio_array):
     model_name = "cobrayyxx/whisper-small-indo-eng-ct2"  # pretrained model - try "tiny", "base", "small", or "medium"
     model = WhisperModel(model_name, device="cuda", compute_type="float16")
    
     segments, info = model.transcribe(audio_array,
                                       beam_size=5,
                                       language="en",
                                       vad_filter=True
                                       )
     return segments, info
    
     def metric_calculation(dataset):
         val_data = fleurs_dataset["validation"]
         bleu = BLEU()
         chrf = CHRF()
         lst_pred = []
         lst_gold = []
         for data in tqdm(val_data):
             gold_standard = data["text_en"]
             gold_standard = gold_standard.lower().strip()
             audio_array = data["audio"]["array"]
             # Ensure it's 1D
             audio_array = np.ravel(audio_array)
     
             # Convert to float32 if necessary
             audio_array = audio_array.astype(np.float32)
             pred_segments, pred_info = predict(audio_array)
             prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
             lst_pred.append(prediction_text)
             lst_gold.append([gold_standard])
         bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
         chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
     
         return bleu_score, chrf_score
    
    Now run the evaluation.
    pretrain_bleu_score, pretrain_chrf_score   = metric_calculation(fleurs_dataset)
    pretrain_bleu_score, pretrain_chrf_score
    

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.2.0
  • Tokenizers 0.20.3

Reference

Credits

Huge thanks to Yasmin Moslem for mentoring me.