Introduce a custom Sentence Transformer module for smooth multi-modality
Hello @infgrad !
Preface
Congratulations on this release! In your space time, too, quite impressive. Looking forward to seeing people experiment with this.
I'm also curious how you used the excellent https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu to train a strong embedding model - perhaps it can be reproduced with the brand new https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 for multilinguality as well?
Pull Request overview
- Introduce a custom Sentence Transformer module, based on the common Transformer module.
- Add a
Normalize
module as well, to always normalize (this can be useful) - Add "padding_side": "right" to the
sentence_bert_config.json
. FYI: this file contains the "defaults" for the new MultiModalTransformer class, so we can just adds our desired defaults there. - Update README example
Details
This model is a textbook example of why I recently added the possibility for custom Sentence Transformer Modules (docs). In short, we can extend any of the existing modules (or make completely new ones) and extend e.g. tokenize
, forward
, __init__
, etc. As you can see here, this allows you to add multimodality without introducing any work to the end user.
My module is simply your custom tokenize
and forward
put into a module, with two small changes: the forward
call is now in charge of ensuring that the pixel_values
is of the correct dtype, and we use max_seq_length
as the tokenizer max length instead of being hardcoded to 1024.
All that remains is explaining to the user what kind of inputs your tokenize
method expects, i.e. what the user can input for correct outputs.
- Tom Aarsen
Hi, thank you, Cannot merge, could you please do some changes on README.md?
Definitely, I resolved the merge conflict now.
Hi, @tomaarsen thank you for your PR.
Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!
As you know, my model is distilled from other model, so what I need is quality unsupervised text with rich sources. If fineweb-2 has better quality and is sampled from different fields, I think the result is better.
I have tried four different distiliation loss and other settings for jasper model, this will be written in my report.
Hi, @tomaarsen thank you for your PR.
Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!
Gladly, I always try and help improve the user experience for promising models!
Makes sense to use fineweb-edu as a high-quality source of unsupervised data. Nice work! I'm looking forward to your report to learn about the distillation losses that you've tried - I've only ever used one or two.
- Tom Aarsen