Messages about new models and report

#27
by infgrad - opened
StellaEncoder org

Hi, everyone, thanks for using stella models.
After six months of work, I trained the jasper model on top of the stella model, which is a multimodal model, and it can be ranked 2 in mteb (submitted the results on 2024-12-11, which may need official review https://github.com/embeddings-benchmark/results/pull/68).

Model link: https://huggingface.co/infgrad/jasper_en_vision_language_v1

I'll focus on the technical report, training data and related code, hopefully the tricks I've used will be of some help to you guys!

This work was accomplished during my free time, it's a personal hobby. One person's time and energy is limited, and you are welcome to make any contributions!

infgrad pinned discussion

Could you explain in your paper how you obtained the dunzhang/stella_en_400M_v5 model ? Is it pure distillation ? or did you retrain using contrastive loss like INFONCe ? I would like to reproduce this model scaling down the number of layers to make a smaller model.
Thanks

StellaEncoder org

Hi, @claeyzre , Thank you for being interesting.
dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.

I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.

The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.

Hello,

I am trying to fine-tune my own pre-trained model with 300M parameters (https://huggingface.co/keeeeenw/MicroLlama) for text embedding.

Based on your replies above,

  1. Is https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation a good starting point for distillation? What would be a good teach model?

  2. Is there a quick starter code for contrastive loss? It looks like you are also using instruction for the embedding model so is https://www.sbert.net/examples/training/prompts/README.html a good starting point?

If you don't have the time to answer these questions, I am looking forward to learning more about these details in your technical report / training code.

Thanks!

StellaEncoder org

Hi, @keeeeenw

  1. https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation is a good starting point for distillation, but their method is different about mine, it can give you a better understanding about distillation

  2. If your teacher model is A and B, then A or B will be the best choice. If A or B is too large for you, you can use a smaller vector model trained with the same data as A or B. If there is no such model, just select a vector model.

  3. As for the starter code for contrastive loss. https://github.com/NLPJCL/RAG-Retrieval may be a choice.

Understood! Thanks for the detailed explanations and thanks for sharing the link. I will study https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding more carefully.

Hi @infgrad ,

dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.
I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.

-> This is interesting because usually it's the other way around: first unsupervised contrastive training, then distillation.

The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.

Too bad :( . Do you have the distillation code somewhere ? I don't mind if it's dirty, I'd be glad to help you making it clearer in your repository.

Thank you very much !

StellaEncoder org

@claeyzre
What is your email? Which code do you want? stella? jasper?

@infgrad my email is ***
I am interested particularly in Stella. Could you send me both the code for Stella and Jasper ?
Thanks !

@infgrad , is it possible to send the training code for Stella to me as well? I am also interested in learning more about your distillation process as well as your data processing pipeline. Your code for https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding is very easy to follow, but I cannot do much with the T2 sample training data or T2 data in general for mteb English rankings because it is in Chinese. I assume you are using a set of different datasets for distillation as well as contrastive loss. My email is [email protected]

StellaEncoder org

Next week, I will try to find some useful scripts and upload to https://huggingface.co/infgrad/jasper_en_vision_language_v1.

Sign up or log in to comment