Whisper-Large-V3 does not work with explicit use of dtype which is given in config.json
Hi,
Thanks for sharing this model and related work.
I downloaded the Whisper-Large-V3 model using HF_hub snapshot by ignoring the patterns for msgpack, h5, fp32*, and safetensors. Thereby, only pytorch_model.bin (with dtype as float16) as the model file (~3GB), and all the rest of the repo files are downloaded.
In config.json, the torch_dtype is given as float16. If I use default settings (without any mentioning of torch_dtype or with explicitly defining it as torch_dtype = torch.float32) for pipeline or AutoConfig... + AutoModel... + AutoProcessor..., it works fine. However, if I try using it with torch_dtype = torch.float16, it gives the following error.
Input type (torch.FloatTensor) and weight type (torch.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor...
My setup:
OS: MacOS Monterey 12.7.1
device = "cpu"
Python: 3.11.6
Transformers: 4.35.2
2 questions:
Why this model is not working with the dtype as given in config.json? If I explicitly define dtype for Large-V3 model, what other changes I should make for it to work?
I also tried with all other Whisper models (tiny, base, small, and medium, all of which have dtype as float32 in config.json). I also faced this problem of using (torch_dtype=torch.float16) there. For this case, is it even possible to use (any) Whisper pytorch model with different dtype than for the entry in config.json or I'm making any basic mistake? If it is possible to use with different dtype, what adaptation I need to make?
Thank you in advance.
The Whisper model was trained in float16
, hence the weights are in float16
on the Hub. When we call from_pretrained
, we automatically upcast to float32
, unless you specify torch_dtype=torch.float16
as you have done: https://huggingface.co/docs/transformers/main_classes/model#model-instantiation-dtype
To fix your issue:
- You also need to convert your inputs to the same dtype as the model, i.e.
input_features = input_features.to(torch.float16)
. If you can share a code snippet of how you're running the model, I can show you where to add this line - Yes - according to 1 you can run any Whisper model in your desired dtype, provided the
input_features
are the same dtype as themodel
You can see an example for float16
evaluation for distil-whisper here: https://huggingface.co/distil-whisper/distil-large-v2#evaluation
Thank you @sanchit-gandhi for your response.
I share my little Colab notebook.
https://colab.research.google.com/drive/1uNCpZd6_g2MeuRn20AOW8cXs7Dbinoff
I'm using simple pipeline call. The types for data and device are defined in the 5th cell. As such, for a given media file (an audio file, I've tested it on mp4/webm/mp3 formats), this code runs perfectly fine with u_torch_dtype = torch.float32
. However, any change here to float16 raises the issue I mentioned.
Please let me know what/where changes are required pertaining to input_features
as you suggested above.