MTEB reproduction

#21
by Samoed - opened

Hi! I wanted to add your model to MTEB so that everyone can easily run it using the platform. I used the following prompts for your model (full code here):

STELLA_S2S_PROMPT = "Instruct: Retrieve semantically similar text.\nQuery: "
STELLA_S2P_PROMPT = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "

STELLA_PROMPTS = {
    "query": STELLA_S2P_PROMPT,
    "passage": "",
    "STS": STELLA_S2S_PROMPT,
    "PairClassification": STELLA_S2S_PROMPT,
    "BitextMining": STELLA_S2S_PROMPT,
}

I've obtained similar results for Pair Classification and STS tasks, but the overall scores don't fully match yours. Could you share more details on how your implementation was set up for MTEB?
Here full results of my run

If the difference is rather small, then it's probably because the MTEB results from this model are for if you use the full 8192 dimensions, but the difference should be small:

Generally speaking, 1024d is good enough. The MTEB score of 1024d is only 0.001 lower than 8192d.

  • Tom Aarsen

For some tasks difference is big. For example, on leaderboard AmazonCounterfactualClassification (en) has 92.36, but I got 72.5. Is this possible to load 8192d using SentenceTransformer? I can't find this in readme and in sentence transformer docs.

I open PR in mteb repo with my implementation and results. Here is full comparison of the results.

Classification

model name AmazonCounterfactualClassification (en) EmotionClassification ToxicConvesationsClassification
stella_en_400M_v5 (leaderboard) 92.36 78.77 89.94
stella_en_400M_v5 72.59 56.48 66.11

Clustering

model name ArxivClusteringS2S RedditClustering
stella_en_400M_v5 (leaderboard) 49.82 71.19
stella_en_400M_v5 45.54 60.75

PairClassification

model name SprintDuplicateQuestions TwitterSemEval2015
stella_en_400M_v5 (leaderboard) 95.59 80.18
stella_en_400M_v5 94.44 80.26

Reranking

model name SciDocsRR AskUbuntuDupQuestions
stella_en_400M_v5 (leaderboard) 88.44 66.15
stella_en_400M_v5 86.40 62.90

Retrieval

model name SCIDOCS SciFact
stella_en_400M_v5 (leaderboard) 25.04 78.23
stella_en_400M_v5 23.96 77.96

STS

model name STS16 STSBenchmark
stella_en_400M_v5 (leaderboard) 87.14 87.74
stella_en_400M_v5 87.00 87.56

Summarization

model name SummEval
stella_en_400M_v5 (leaderboard) 31.66
stella_en_400M_v5 30.59

Full results

@infgrad Can you provide details how you evaluated mteb?

I tried to run your model with gte_loader

stella_en_400M = ModelMeta(
    loader=partial(
        gte_loader,
        model_name_or_path="dunzhang/stella_en_400M_v5",
        attn="cccc",
        pooling_method="lasttoken",
        mode="embedding",
        torch_dtype="auto",
        # The ST script does not normalize while the HF one does so unclear what to do
        # https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct#sentence-transformers
        normalized=True,
    ),
    name="dunzhang/stella_en_400M_v5",
    languages=["eng_Latn"],
    open_source=True,
    revision="1bb50bc7bb726810eac2140e62155b88b0df198f",
    release_date="2024-07-12",
)

and it gives better results, but still lower than you reported

StellaEncoder org

Hi, @Samoed
Try these settings:

  1. max_len = 400
  2. do not normalize vectors for Classification task
  3. use e5-mistral prompts, stella model's evaluation is same as e5-mistral or gte-qwen2
  4. inference with bf16, e.g. load_dtype = torch.bf16

It's been so long I can't remember the details.

However, I have recently been working on multimodal encoder, as part of that work I'm going to have to reproduce stella's results and upload the evaluation scripts.

Finally, If you still cannot reproduce the results, you can wait a while.

Thank you!

Samoed changed discussion status to closed

Sign up or log in to comment