💬 Discussion thread: Model scores and model performances 💬

#265
by clefourrier HF staff - opened
Open LLM Leaderboard org

Hi!
From time to time, people open discussions to discuss their favorite models scores, the evolution of different model families through time, etc.
We love seeing these threads, and very cool insights often emerge from them, but they are not actionable for us (as they are discussions, not issues).
As such, it would be simpler for the OpenLLMLeaderboard team to have all these discussions centralized in one place.
This is what this discussion thread will be for. It will be kept pinned and open to allow users to debate and discuss.

TLDR:

  • If you want to discuss models and scores with the people who frequent the leaderboard, do so here.
  • If it's about something we can fix, open a new issue.

Thank you so much for reading :)

clefourrier pinned discussion

Ok, I will put it here:
On the leaderboard, the difference between average scores of OPT-66b and a good fine tuned 3-b model (acrastt/Marx-3B-V2) is quite small.
That seems odd to me. OPT is older, has seen less and possibly lower quality text data than this openllama-3b model fine tune.
But it is still weird. the model is 22 times larger. What a waste. Imagine you'd get llama2-70b performance out of a model with 1.5 trillion parameters.
That's not good. It's disproportional.

Its just not lower quality data, its a old design. That will obviously happen and newer models will perform better.

I'd guess the data is the main reason, or something else is not correct. There are for example nvidia made gpt-2 versions which perform surprisingly well, just by being trained on more tokens.
That's not a change to the architecture. You could use OPT architecture and train a competitive high performance model with the right data and enough compute.
While the 3b model has seen more tokens than opt-66b, the difference is not so large that one ought to expect a 66b model to look bad next to it.
This is weird.

If it is indeed due to OPTs lower pretraining data quality, they must have put a lot of low quality data in it.
That can happen. Meta also made a megatron gpt2 with 11b parameters. Total trash. Sometimes they don't get it right.
But it is really not good. OPT-175b was supposed to have roughly gpt-3 (old version) performance. I still have my doubts it is just a worse dataset.

Though check out the opt-66b MMLU score. It's really bad. That must have been some dataset.

Also, opt was only trained on 180b tokens while openllama 3b was trained on 1trillion tokens. The amount of tokens pretrained on also should have a massive difference

Would have thought they at least replicated the size of GPT-3 pretraining data. Apparently not the case. A scientific artefact. Good it brought us to llama.

Hey! everyone, I think claiming that the Metamath data is contaminated is an unsuitable allegation. All the MetaMathQA data is sourced from the GSM8K and MATH train sets. Our data augmentation process does not involve any data from the test set.
We have transparently disclosed the source of each data point, and you can verify it here: https://huggingface.co/datasets/meta-math/MetaMathQA
Additionally, we have disclosed the code for obtaining MetaMathQA data, which can be checked at: https://github.com/meta-math/MetaMath/tree/main/code_for_generating_data

The Metamath project is entirely transparent and open-source, encompassing code (for both data augmentation and model training), models (a range of models), and data (comprising all our data along with its sources). Anyone interested in contributing is welcome to join our HuggingFace. Again, Allegations of contamination in MetaMathQA data are detrimental to us (I personally feel quite disheartened). We have never utilized any test data, and all our data and models are transparently and openly available: https://huggingface.co/meta-math

Hi, My friends, @Mihaiii @clefourrier   @Weyaxi @euclaise @killawhale2 @Q-bert @ChuckMcSneed
The MetaMath project is fully transparent and open-source, including all the data, model, code.
MetaMath is always eager to make more contributions to the open-source LLM, if you have any questions, we would be more than happy to help!

One aspect I suspect is that the accuracy of this leak detection code, for example, training solely on the GSM8K train set and comparing the scores between models trained for 1 epoch and 20 epochs, may exhibit disparities, despite the data itself remaining unchanged.

https://github.com/swj0419/detect-pretrain-code-contamination is an excellent repository. However, the detection of data contamination might not be as precise, which could explain why MetaMath only trained on the train set but was mistakenly flagged as contaminated.

Open LLM Leaderboard org

Hi @Longhui98 ,
Super cool to get this feedback from you, the transparency is great! :)

Side question if you have the time, did you account for self contamination in MATH when building your dataset? It's not a trivial thing to anticipate, so I was wondering if you had taken it into account
(Like lmsys reported in their contamination blog post)

Hey all, just wanted to make sure I understood the main takeaways from this thread in regard to the MetaMathQA dataset:

  1. There have been concerns that MetaMathQA may be contaminated with GSM8K
  2. Tests using a public contamination detection tool indicate that this may be the case
  3. This tool is not 100% accurate, and it's possible that both the reference model and the threshold may be playing significant roles here
  4. LMSys reported that rephrasing the test set of a dataset is still a form of contamination
  5. The MetaMathQA developer has made it clear that the dataset was constructed using the train split of GSM8K, and not the test set

Given only the train set was used and not the test set, meaning the LMSys report isn't necessarily relevant to this situation (unless GSM8K itself has train/test contamination I am unaware of), the hesitancy toward the results of the contamination detection tool, and the transparency from the MetaMathQA developer, is the current consensus that MetaMathQA is not contaminated and we are safe to train models using this dataset?

I may have missed something so please let me know if I have misread or misinterpreted any of the information here! Thanks :)

Open LLM Leaderboard org

Hi @OxxoCodes !
It's an accurate summary, thank you :)

Just to give you an example of how rephrasing the test set can be a big contamination factor (example with MATH, from the LMSYS report)
image.png

I think the main problem is that it's unclear how much of GSM8K is cross-contaminated between its train and test set, and it would need someone to go look at each sample (and I have not had the time so far, I'll have the bandwidth in end of Feb I think). There are examples of rephrases between train and test of GSM8K in the LMSYS paper, but they are not as bad as the above example (which would probably be the only kind of rephrases I would consider contamination).

So to answer your question, I think you can fine-tune with MetaMathQA, and once we have the time to go back with GMS8K, if we find that some test examples are contaminated on the train, we'll remove them a posteriori from the score computations and recompute the scores for every model, which shouldn't be too costly as we store all the predictions.

I question if the MATH train/test contamination is actually "rephrasing", it seems more likely to me that it is just coincidentally semantically identical. It's a relatively simple question, so two people independently coming up with it the same doesn't seem implausible to me. Further, some extent of semantic similarity is necessary - there needs to be a cutoff chosen with some principle.

@clefourrier Sounds great, thanks! :)

A.I. models score high on gaokao language tests, low in math
https://m.youtube.com/watch?v=dQfFsRyYwM8

clefourrier changed discussion status to closed

Sign up or log in to comment