Messages about new models and report

#27

pinned

by infgrad - opened 27 days ago

StellaEncoder org 27 days ago

Hi, everyone, thanks for using stella models.
After six months of work, I trained the jasper model on top of the stella model, which is a multimodal model, and it can be ranked 2 in mteb (submitted the results on 2024-12-11, which may need official review https://github.com/embeddings-benchmark/results/pull/68).

Model link: https://huggingface.co/infgrad/jasper_en_vision_language_v1

I'll focus on the technical report, training data and related code, hopefully the tricks I've used will be of some help to you guys!

This work was accomplished during my free time, it's a personal hobby. One person's time and energy is limited, and you are welcome to make any contributions!

infgrad pinned discussion 27 days ago

claeyzre

11 days ago

•

edited 11 days ago

Could you explain in your paper how you obtained the dunzhang/stella_en_400M_v5 model ? Is it pure distillation ? or did you retrain using contrastive loss like INFONCe ? I would like to reproduce this model scaling down the number of layers to make a smaller model.
Thanks

infgrad

StellaEncoder org 11 days ago

Hi, @claeyzre , Thank you for being interesting.
dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.

I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.

The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.

keeeeenw

9 days ago

Hello,

I am trying to fine-tune my own pre-trained model with 300M parameters (https://huggingface.co/keeeeenw/MicroLlama) for text embedding.

Based on your replies above,

Is https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation a good starting point for distillation? What would be a good teach model?
Is there a quick starter code for contrastive loss? It looks like you are also using instruction for the embedding model so is https://www.sbert.net/examples/training/prompts/README.html a good starting point?

If you don't have the time to answer these questions, I am looking forward to learning more about these details in your technical report / training code.

Thanks!

infgrad

StellaEncoder org 8 days ago

Hi, @keeeeenw

https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation is a good starting point for distillation, but their method is different about mine, it can give you a better understanding about distillation
If your teacher model is A and B, then A or B will be the best choice. If A or B is too large for you, you can use a smaller vector model trained with the same data as A or B. If there is no such model, just select a vector model.
As for the starter code for contrastive loss. https://github.com/NLPJCL/RAG-Retrieval may be a choice.

keeeeenw

8 days ago

Understood! Thanks for the detailed explanations and thanks for sharing the link. I will study https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding more carefully.

claeyzre

5 days ago

Hi @infgrad ,

dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.
I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.

-> This is interesting because usually it's the other way around: first unsupervised contrastive training, then distillation.

The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.

Too bad :( . Do you have the distillation code somewhere ? I don't mind if it's dirty, I'd be glad to help you making it clearer in your repository.

Thank you very much !

infgrad

StellaEncoder org 5 days ago

@claeyzre
What is your email? Which code do you want? stella? jasper?

claeyzre

4 days ago

•

edited 1 day ago

@infgrad my email is ***
I am interested particularly in Stella. Could you send me both the code for Stella and Jasper ?
Thanks !

keeeeenw

4 days ago

@infgrad , is it possible to send the training code for Stella to me as well? I am also interested in learning more about your distillation process as well as your data processing pipeline. Your code for https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding is very easy to follow, but I cannot do much with the T2 sample training data or T2 data in general for mteb English rankings because it is in Chinese. I assume you are using a set of different datasets for distillation as well as contrastive loss. My email is [email protected]

infgrad

StellaEncoder org 4 days ago

Next week, I will try to find some useful scripts and upload to https://huggingface.co/infgrad/jasper_en_vision_language_v1.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment