whisper-small-indo-eng

Model description

This model is a fine-tuned version of openai/whisper-small on an cobrayyxx/FLEURS_INDO-ENG_Speech_Translation dataset.

Dataset: FLEURS_INDO-ENG_Speech_Translation

This model was fine-tuned using the cobrayyxx/FLEURS_INDO-ENG_Speech_Translation dataset, a speech translation dataset for the Indonesian โ†” English language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.

Key Features:

  • audio: Audio clip in Bahasa/Indonesian
  • text_indo: Audio transcription in Bahasa/Indonesian.
  • text_en: Audio transcription in English.

Dataset Usage

  • Training Data: Used to fine-tune the Whisper model for Indonesian โ†’ English speech-to-text translation.
  • Validation Data: Used to evaluate the performance of the model during training.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps (epoch): 100
  • mixed_precision_training: Native AMP

Model Evaluation

The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset. This fine-tuned model shows a slight improvement over the baseline model.

Model BLEU Score CHRF Score
Baseline Model 33.03 52.71
Fine-Tuned Model 34.82 61.45

Evaluation Details

  • BLEU: Measures the overlap between predicted and reference text based on n-grams.
  • CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

Reproduce Steps

After training and push the training model to hugging-face. we have to follow several steps before we can evaluate it:

  1. Push tokenizer manually by creating it from WhisperTokenizerFast.
    from transformers import WhisperTokenizerFast
    
     # Load your fine-tuned tokenizer
     tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
     
     # Save the tokenizer locally
     tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
     
     # Push the tokenizer to the Hugging Face Hub
     tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
    
  2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
    !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
    
  3. Load the model for WhisperModel with your ct2-model, in this case is cobrayyxx/whisper-small-indo-eng-ct2.
  4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
      def predict(audio_array):
     model_name = "cobrayyxx/whisper-small-indo-eng-ct2"  # pretrained model - try "tiny", "base", "small", or "medium"
     model = WhisperModel(model_name, device="cuda", compute_type="float16")
    
     segments, info = model.transcribe(audio_array,
                                       beam_size=5,
                                       language="en",
                                       vad_filter=True
                                       )
     return segments, info
    
     def metric_calculation(dataset):
         val_data = fleurs_dataset["validation"]
         bleu = BLEU()
         chrf = CHRF()
         lst_pred = []
         lst_gold = []
         for data in tqdm(val_data):
             gold_standard = data["text_en"]
             gold_standard = gold_standard.lower().strip()
             audio_array = data["audio"]["array"]
             # Ensure it's 1D
             audio_array = np.ravel(audio_array)
     
             # Convert to float32 if necessary
             audio_array = audio_array.astype(np.float32)
             pred_segments, pred_info = predict(audio_array)
             prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
             lst_pred.append(prediction_text)
             lst_gold.append([gold_standard])
         bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
         chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
     
         return bleu_score, chrf_score
    
    Now run the evaluation.
    pretrain_bleu_score, pretrain_chrf_score   = metric_calculation(fleurs_dataset)
    pretrain_bleu_score, pretrain_chrf_score
    

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.2.0
  • Tokenizers 0.20.3

Reference

Credits

Huge thanks to Yasmin Moslem for mentoring me.

Downloads last month
89
Safetensors
Model size
242M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for cobrayyxx/whisper-small-indo-eng

Finetuned
(2210)
this model