Code for training / finetuning the sparse encoders

#2
by oneryalcin - opened

Forst of all many thanks for both v1 and v2 models, we are using v1 and happy with the retrieval quality in general. I'll be evaluating v2 models as soon as I can. My question is about if you have any plans to release documentation around pretraining or fine tuning the sparse encoder models. I'd like to adapt the model to our domains (to help with out of vocabulary words) and increase recall.

I'd appreciate if you could share a github repo or any blog that explains how to fine tune or pretrain these models. Many thanks again.

opensearch-project org

Thanks for your interest on our project! We're condensing our training techniques and plan to release a paper about these details. After that we'll release the code on github. But the concrete repo to place the code is not decided yet.

Many thanks I'll be looking forward to read the paper and test the code :)

This is great work! Looking forward to the paper! Any ideas when you might release it?

opensearch-project org

@freethenation We have finished the draft version, and now we're working on improving the structure and writing. After we finilize the paper, it still needs to go through some internal review before the paper and code can be public released. I guess we still need a few months to finish these

Any updates on this?

opensearch-project org

Any updates on this?

We're under the internal review to make them public

opensearch-project org

Hi @oneryalcin @freethenation @macavaney , the paper is public now: https://arxiv.org/abs/2411.04403! The code & data is still under a dedicated review process.

Super excited! Thanks for pinging this thread. Going to read your paper now! Update this thread when data & code are available too?

opensearch-project org

Super excited! Thanks for pinging this thread. Going to read your paper now! Update this thread when data & code are available too?

Yes will give update here : )

Awesome, thanks!!

many thanks @zhichao-geng . I'll right dive into the paper now.

Edit: sorry couldn't help adding a podcast on this paper. I'm listening it on my way and just wanted to share:
https://notebooklm.google.com/notebook/4a37f025-66c4-40dd-b340-239f6f3ea59a/audio

opensearch-project org

we have public the code of fine-tuning/evaluating the model(repo link). It can also be used to train a sparse model from scratch. You can reproduce the results if following the process of generating training data described in the paper.

We also aim to release the training data generated by us, but not sure whether this comply with the licenses of all used datasets and it's still under review.

Sign up or log in to comment