cobrayyxx commited on
Commit
2f2761c
·
verified ·
1 Parent(s): c0c3955

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -16
README.md CHANGED
@@ -14,23 +14,24 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # whisper-small-indo-eng
16
 
17
- This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an unknown dataset.
18
 
19
  ## Model description
20
 
21
- More information needed
22
 
23
- ## Intended uses & limitations
24
 
25
- More information needed
 
 
 
 
26
 
27
- ## Training and evaluation data
 
 
28
 
29
- More information needed
30
-
31
- ## Training procedure
32
-
33
- ### Training hyperparameters
34
 
35
  The following hyperparameters were used during training:
36
  - learning_rate: 1e-05
@@ -40,16 +41,93 @@ The following hyperparameters were used during training:
40
  - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
41
  - lr_scheduler_type: linear
42
  - lr_scheduler_warmup_steps: 500
43
- - training_steps: 100
44
  - mixed_precision_training: Native AMP
45
 
46
- ### Training results
47
-
48
-
49
-
50
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  - Transformers 4.46.3
53
  - Pytorch 2.5.1+cu121
54
  - Datasets 3.2.0
55
  - Tokenizers 0.20.3
 
 
 
 
 
 
 
14
 
15
  # whisper-small-indo-eng
16
 
 
17
 
18
  ## Model description
19
 
20
+ This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an [cobrayyxx/FLEURS_INDO-ENG_Speech_Translation](https://huggingface.co/datasets/cobrayyxx/FLEURS_INDO-ENG_Speech_Translation) dataset.
21
 
22
+ ## Dataset: FLEURS_INDO-ENG_Speech_Translation
23
 
24
+ This model was fine-tuned using the `cobrayyxx/FLEURS_INDO-ENG_Speech_Translation` dataset, a speech translation dataset for the **Indonesian ↔ English** language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.
25
+ ### Key Features:
26
+ - **audio**: Audio clip in Bahasa/Indonesian
27
+ - **text_indo**: Audio transcription in Bahasa/Indonesian.
28
+ - **text_en**: Audio transcription in English.
29
 
30
+ ### Dataset Usage
31
+ - **Training Data**: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation.
32
+ - **Validation Data**: Used to evaluate the performance of the model during training.
33
 
34
+ ## Training hyperparameters
 
 
 
 
35
 
36
  The following hyperparameters were used during training:
37
  - learning_rate: 1e-05
 
41
  - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
42
  - lr_scheduler_type: linear
43
  - lr_scheduler_warmup_steps: 500
44
+ - training_steps (epoch): 100
45
  - mixed_precision_training: Native AMP
46
 
47
+ ## Model Evaluation
48
+ The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset.
49
+ This fine-tuned model shows a slight improvement over the baseline model.
50
+ | Model | BLEU Score | CHRF Score |
51
+ |------------------|------------|------------|
52
+ | Baseline Model | **33.03** | **52.71** |
53
+ | Fine-Tuned Model | **34.82** | **61.45** |
54
+
55
+ ### Evaluation Details
56
+ - **BLEU**: Measures the overlap between predicted and reference text based on n-grams.
57
+ - **CHRF**: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.
58
+
59
+ ### Reproduce Steps
60
+ After [training](https://huggingface.co/blog/fine-tune-whisper) and push the training model to hugging-face.
61
+ we have to follow several steps before we can evaluate it:
62
+ 1. Push tokenizer manually by creating it from WhisperTokenizerFast.
63
+ ```
64
+ from transformers import WhisperTokenizerFast
65
+
66
+ # Load your fine-tuned tokenizer
67
+ tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
68
+
69
+ # Save the tokenizer locally
70
+ tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
71
+
72
+ # Push the tokenizer to the Hugging Face Hub
73
+ tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
74
+ ```
75
+ 2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
76
+ ```
77
+ !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
78
+ ```
79
+ 3. Load the model for WhisperModel with your ct2-model, in this case is `cobrayyxx/whisper-small-indo-eng-ct2`.
80
+ 4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
81
+ ```
82
+ def predict(audio_array):
83
+ model_name = "cobrayyxx/whisper-small-indo-eng-ct2" # pretrained model - try "tiny", "base", "small", or "medium"
84
+ model = WhisperModel(model_name, device="cuda", compute_type="float16")
85
+
86
+ segments, info = model.transcribe(audio_array,
87
+ beam_size=5,
88
+ language="en",
89
+ vad_filter=True
90
+ )
91
+ return segments, info
92
+
93
+ def metric_calculation(dataset):
94
+ val_data = fleurs_dataset["validation"]
95
+ bleu = BLEU()
96
+ chrf = CHRF()
97
+ lst_pred = []
98
+ lst_gold = []
99
+ for data in tqdm(val_data):
100
+ gold_standard = data["text_en"]
101
+ gold_standard = gold_standard.lower().strip()
102
+ audio_array = data["audio"]["array"]
103
+ # Ensure it's 1D
104
+ audio_array = np.ravel(audio_array)
105
+
106
+ # Convert to float32 if necessary
107
+ audio_array = audio_array.astype(np.float32)
108
+ pred_segments, pred_info = predict(audio_array)
109
+ prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
110
+ lst_pred.append(prediction_text)
111
+ lst_gold.append([gold_standard])
112
+ bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
113
+ chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
114
+
115
+ return bleu_score, chrf_score
116
+ ```
117
+ Now run the evaluation.
118
+ ```
119
+ pretrain_bleu_score, pretrain_chrf_score = metric_calculation(fleurs_dataset)
120
+ pretrain_bleu_score, pretrain_chrf_score
121
+ ```
122
+ ## Framework versions
123
 
124
  - Transformers 4.46.3
125
  - Pytorch 2.5.1+cu121
126
  - Datasets 3.2.0
127
  - Tokenizers 0.20.3
128
+
129
+ ## Reference
130
+ - https://huggingface.co/blog/fine-tune-whisper
131
+
132
+ ## Credits
133
+ Huge thanks to [Yasmin Moslem ](https://huggingface.co/ymoslem) for mentoring me.