--- library_name: transformers license: apache-2.0 base_model: openai/whisper-small tags: - generated_from_trainer model-index: - name: whisper-small-indo-eng results: [] --- # whisper-small-indo-eng ## Model description This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an [cobrayyxx/FLEURS_INDO-ENG_Speech_Translation](https://huggingface.co/datasets/cobrayyxx/FLEURS_INDO-ENG_Speech_Translation) dataset. ## Dataset: FLEURS_INDO-ENG_Speech_Translation This model was fine-tuned using the `cobrayyxx/FLEURS_INDO-ENG_Speech_Translation` dataset, a speech translation dataset for the **Indonesian ↔ English** language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks. ### Key Features: - **audio**: Audio clip in Bahasa/Indonesian - **text_indo**: Audio transcription in Bahasa/Indonesian. - **text_en**: Audio transcription in English. ### Dataset Usage - **Training Data**: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation. - **Validation Data**: Used to evaluate the performance of the model during training. ## Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps (epoch): 100 - mixed_precision_training: Native AMP ## Model Evaluation The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset. This fine-tuned model shows a slight improvement over the baseline model. | Model | BLEU Score | CHRF Score | |------------------|------------|------------| | Baseline Model | **33.03** | **52.71** | | Fine-Tuned Model | **34.82** | **61.45** | ### Evaluation Details - **BLEU**: Measures the overlap between predicted and reference text based on n-grams. - **CHRF**: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages. ### Reproduce Steps After [training](https://huggingface.co/blog/fine-tune-whisper) and push the training model to hugging-face. we have to follow several steps before we can evaluate it: 1. Push tokenizer manually by creating it from WhisperTokenizerFast. ``` from transformers import WhisperTokenizerFast # Load your fine-tuned tokenizer tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate") # Save the tokenizer locally tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False) # Push the tokenizer to the Hugging Face Hub tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng") ``` 2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion) ``` !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16 ``` 3. Load the model for WhisperModel with your ct2-model, in this case is `cobrayyxx/whisper-small-indo-eng-ct2`. 4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation. ``` def predict(audio_array): model_name = "cobrayyxx/whisper-small-indo-eng-ct2" # pretrained model - try "tiny", "base", "small", or "medium" model = WhisperModel(model_name, device="cuda", compute_type="float16") segments, info = model.transcribe(audio_array, beam_size=5, language="en", vad_filter=True ) return segments, info def metric_calculation(dataset): val_data = fleurs_dataset["validation"] bleu = BLEU() chrf = CHRF() lst_pred = [] lst_gold = [] for data in tqdm(val_data): gold_standard = data["text_en"] gold_standard = gold_standard.lower().strip() audio_array = data["audio"]["array"] # Ensure it's 1D audio_array = np.ravel(audio_array) # Convert to float32 if necessary audio_array = audio_array.astype(np.float32) pred_segments, pred_info = predict(audio_array) prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip() lst_pred.append(prediction_text) lst_gold.append([gold_standard]) bleu_score = bleu.corpus_score(lst_pred, lst_gold).score chrf_score = chrf.corpus_score(lst_pred, lst_gold).score return bleu_score, chrf_score ``` Now run the evaluation. ``` pretrain_bleu_score, pretrain_chrf_score = metric_calculation(fleurs_dataset) pretrain_bleu_score, pretrain_chrf_score ``` ## Framework versions - Transformers 4.46.3 - Pytorch 2.5.1+cu121 - Datasets 3.2.0 - Tokenizers 0.20.3 ## Reference - https://huggingface.co/blog/fine-tune-whisper ## Credits Huge thanks to [Yasmin Moslem ](https://huggingface.co/ymoslem) for mentoring me.