Training for translation
May I ask, how do I fine tune Whisper with translation from english to another language? Mainly want to know what the dataset should look like. There is a tutorial for whisper fine tuning with the example data set is a pair of audio with its original language text. To train for translation, do I pair the audio with the translated text? Many thanks for the help.
Also do you recommend using whisper for audio translation or using other models? Appreciate it.
Hey
@tirtohadi
- you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text)
. In the tokenizer and processor, you should the task to translate
, and the language
to your target language. Whisper should work quite well for this task, especially this new large-v3
version.
While training , how do i save the model as pickel file “pytorch_model.bin”.
When i push the model to the repo using “trainer.push_to_hub(**kwargs)” [ex:CKSINGH/whisper-small-hi-firefox] , i dont see the pickel file pushed along side .
How i can save the pickle file . as I would need it to integrate it to a LM .?
thank you @sanchit-gandhi let me try it out
hi
@sanchit-gandhi
I tried to follow the tutorial without modification in google colab but gotten this issue: ImportError: Using the Trainer
with PyTorch
requires accelerate>=0.20.1
: Please run pip install transformers[torch]
or pip install accelerate -U
This is when trying to run the following code:
from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
)
Perhaps can give a little guidance? Thanks once again for the effort in coming up with the tutorial
Btw i have run !pip install transformers[torch] and !pip install accelerate -U in the Colab but no difference
@tirtohadi after !pip install accelerate -U restart the colab session, it will work.
Same issue faced by me, but need a fix for the root cause
Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of
(English audio, translated text)
. In the tokenizer and processor, you should the task totranslate
, and thelanguage
to your target language. Whisper should work quite well for this task, especially this newlarge-v3
version.
@sanchit-gandhi
How will be the compute of the metric WER considering the translation task? I was wondering if is more smart to fine-tune a dialect of a language with a translate (to english) task instead of transcription. My data is gather by some .srt (film) and sometimes the text can be the same as meaning of the audio but not litterally the same, so i thought that maybe the translation task is focused more on the meaning, am I wrong?
But at the same time I think that calculating the WER can misleading in this situation.
Thanks for the nice tutorial on HF btw
PS. The audio are in a italian dialect, and the trascription field is in Italian
@tirtohadi
hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.
@tirtohadi hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.
I think that 10hours of audio in the new language will be enough
Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of
(English audio, translated text)
. In the tokenizer and processor, you should the task totranslate
, and thelanguage
to your target language. Whisper should work quite well for this task, especially this newlarge-v3
version.
@sanchit-gandhi
For fine-tuning Whisper translation, is the loss function the same as it is for transcription? And would it be more beneficial to use BLEU rather than WER?
When I train for translation to English my accuracy for transcription of the same language is reducing (pretty much forgetting) . Is there a way I can train for translation and transcription in one go so that it creates a balance ? If so how to structure my training code ?
Any help is apprieciated ?