Update README.md
Browse files
README.md
CHANGED
@@ -14,23 +14,24 @@ should probably proofread and complete it, then remove this comment. -->
|
|
14 |
|
15 |
# whisper-small-indo-eng
|
16 |
|
17 |
-
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an unknown dataset.
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
-
|
22 |
|
23 |
-
##
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
|
|
|
|
28 |
|
29 |
-
|
30 |
-
|
31 |
-
## Training procedure
|
32 |
-
|
33 |
-
### Training hyperparameters
|
34 |
|
35 |
The following hyperparameters were used during training:
|
36 |
- learning_rate: 1e-05
|
@@ -40,16 +41,93 @@ The following hyperparameters were used during training:
|
|
40 |
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
41 |
- lr_scheduler_type: linear
|
42 |
- lr_scheduler_warmup_steps: 500
|
43 |
-
- training_steps: 100
|
44 |
- mixed_precision_training: Native AMP
|
45 |
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
- Transformers 4.46.3
|
53 |
- Pytorch 2.5.1+cu121
|
54 |
- Datasets 3.2.0
|
55 |
- Tokenizers 0.20.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
# whisper-small-indo-eng
|
16 |
|
|
|
17 |
|
18 |
## Model description
|
19 |
|
20 |
+
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an [cobrayyxx/FLEURS_INDO-ENG_Speech_Translation](https://huggingface.co/datasets/cobrayyxx/FLEURS_INDO-ENG_Speech_Translation) dataset.
|
21 |
|
22 |
+
## Dataset: FLEURS_INDO-ENG_Speech_Translation
|
23 |
|
24 |
+
This model was fine-tuned using the `cobrayyxx/FLEURS_INDO-ENG_Speech_Translation` dataset, a speech translation dataset for the **Indonesian ↔ English** language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.
|
25 |
+
### Key Features:
|
26 |
+
- **audio**: Audio clip in Bahasa/Indonesian
|
27 |
+
- **text_indo**: Audio transcription in Bahasa/Indonesian.
|
28 |
+
- **text_en**: Audio transcription in English.
|
29 |
|
30 |
+
### Dataset Usage
|
31 |
+
- **Training Data**: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation.
|
32 |
+
- **Validation Data**: Used to evaluate the performance of the model during training.
|
33 |
|
34 |
+
## Training hyperparameters
|
|
|
|
|
|
|
|
|
35 |
|
36 |
The following hyperparameters were used during training:
|
37 |
- learning_rate: 1e-05
|
|
|
41 |
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
42 |
- lr_scheduler_type: linear
|
43 |
- lr_scheduler_warmup_steps: 500
|
44 |
+
- training_steps (epoch): 100
|
45 |
- mixed_precision_training: Native AMP
|
46 |
|
47 |
+
## Model Evaluation
|
48 |
+
The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset.
|
49 |
+
This fine-tuned model shows a slight improvement over the baseline model.
|
50 |
+
| Model | BLEU Score | CHRF Score |
|
51 |
+
|------------------|------------|------------|
|
52 |
+
| Baseline Model | **33.03** | **52.71** |
|
53 |
+
| Fine-Tuned Model | **34.82** | **61.45** |
|
54 |
+
|
55 |
+
### Evaluation Details
|
56 |
+
- **BLEU**: Measures the overlap between predicted and reference text based on n-grams.
|
57 |
+
- **CHRF**: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.
|
58 |
+
|
59 |
+
### Reproduce Steps
|
60 |
+
After [training](https://huggingface.co/blog/fine-tune-whisper) and push the training model to hugging-face.
|
61 |
+
we have to follow several steps before we can evaluate it:
|
62 |
+
1. Push tokenizer manually by creating it from WhisperTokenizerFast.
|
63 |
+
```
|
64 |
+
from transformers import WhisperTokenizerFast
|
65 |
+
|
66 |
+
# Load your fine-tuned tokenizer
|
67 |
+
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
|
68 |
+
|
69 |
+
# Save the tokenizer locally
|
70 |
+
tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
|
71 |
+
|
72 |
+
# Push the tokenizer to the Hugging Face Hub
|
73 |
+
tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
|
74 |
+
```
|
75 |
+
2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
|
76 |
+
```
|
77 |
+
!ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
|
78 |
+
```
|
79 |
+
3. Load the model for WhisperModel with your ct2-model, in this case is `cobrayyxx/whisper-small-indo-eng-ct2`.
|
80 |
+
4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
|
81 |
+
```
|
82 |
+
def predict(audio_array):
|
83 |
+
model_name = "cobrayyxx/whisper-small-indo-eng-ct2" # pretrained model - try "tiny", "base", "small", or "medium"
|
84 |
+
model = WhisperModel(model_name, device="cuda", compute_type="float16")
|
85 |
+
|
86 |
+
segments, info = model.transcribe(audio_array,
|
87 |
+
beam_size=5,
|
88 |
+
language="en",
|
89 |
+
vad_filter=True
|
90 |
+
)
|
91 |
+
return segments, info
|
92 |
+
|
93 |
+
def metric_calculation(dataset):
|
94 |
+
val_data = fleurs_dataset["validation"]
|
95 |
+
bleu = BLEU()
|
96 |
+
chrf = CHRF()
|
97 |
+
lst_pred = []
|
98 |
+
lst_gold = []
|
99 |
+
for data in tqdm(val_data):
|
100 |
+
gold_standard = data["text_en"]
|
101 |
+
gold_standard = gold_standard.lower().strip()
|
102 |
+
audio_array = data["audio"]["array"]
|
103 |
+
# Ensure it's 1D
|
104 |
+
audio_array = np.ravel(audio_array)
|
105 |
+
|
106 |
+
# Convert to float32 if necessary
|
107 |
+
audio_array = audio_array.astype(np.float32)
|
108 |
+
pred_segments, pred_info = predict(audio_array)
|
109 |
+
prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
|
110 |
+
lst_pred.append(prediction_text)
|
111 |
+
lst_gold.append([gold_standard])
|
112 |
+
bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
|
113 |
+
chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
|
114 |
+
|
115 |
+
return bleu_score, chrf_score
|
116 |
+
```
|
117 |
+
Now run the evaluation.
|
118 |
+
```
|
119 |
+
pretrain_bleu_score, pretrain_chrf_score = metric_calculation(fleurs_dataset)
|
120 |
+
pretrain_bleu_score, pretrain_chrf_score
|
121 |
+
```
|
122 |
+
## Framework versions
|
123 |
|
124 |
- Transformers 4.46.3
|
125 |
- Pytorch 2.5.1+cu121
|
126 |
- Datasets 3.2.0
|
127 |
- Tokenizers 0.20.3
|
128 |
+
|
129 |
+
## Reference
|
130 |
+
- https://huggingface.co/blog/fine-tune-whisper
|
131 |
+
|
132 |
+
## Credits
|
133 |
+
Huge thanks to [Yasmin Moslem ](https://huggingface.co/ymoslem) for mentoring me.
|