Fine-tuning Alibaba-NLP/gte-Qwen2-7B-instruct for Domain-Specific Retrieval with Query, Positive, and Hard Negatives

#41
by wilfoderek - opened

Hi,

I am exploring the possibility of fine-tuning the Alibaba-NLP/gte-Qwen2-7B-instruct model for a domain-specific retrieval task in spanish using a dataset formatted as follows:

Query: A single text input representing the search query.
Positive examples: A list of documents relevant to the query.
Hard negatives: A list of documents contextually similar to the query but explicitly non-relevant.

Could you provide some examples or recommendations for configuring the model to handle this structure effectively? Additionally:

Are there specific pre-processing steps required to handle Spanish text or domain-specific terminology?
Does the model have any inherent support for Spanish, or are there additional considerations when working with non-English datasets?
Are there examples or guidelines available for fine-tuning the model on a retrieval task with this format?
I would greatly appreciate any insights, examples, or resources that could help in this process.

Hello,
is there any fine-tuning script for this model? It would be interesting to tune this model for downstream tasks.
Thanks !

Hello, You can use our open source projects to fine-tune Alibaba-NLP/gte-Qwen2-7B-instruct :https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding

Hello, You can use our open source projects to fine-tune Alibaba-NLP/gte-Qwen2-7B-instruct :https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding

Great, I will try this script. Thanks for your advise!

Sign up or log in to comment