[Feedback welcome] Add evaluation results to model card metadata
This is a work in progress. The goal is to list evaluation results in the model card metadata, especially the results from the Open LLM Leaderboard. This PR has not been created automatically.
Pending questions:
- Should we report all metrics for each task? (especially the
_stderr
ones?) Or only the one that is displayed in the LLM Leaderboard? - Are the dataset
type
/name
/config
/split
/num_few_shot
accurate in the suggested changes? - How to report the MMLU results? There are 57 different
hendrycksTest
datasets for a total of 228 metrics? π΅ - How to report MT-Bench results? (asking since they are reported in the model card but not in the metadata)
- How to report AlpacaEval results? (asking since they are reported in the model card but not in the metadata)
Related thread: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/370#65663f60589e212284db2ffc.
Related PR in the Hub docs: https://github.com/huggingface/hub-docs/pull/1144.
Thanks to @clefourrier who guided me with the Open LLM Leaderboard results π€
Should we report all metrics for each task? (especially the _stderr ones?) Or only the one that is displayed in the LLM Leaderboard?
- In my opinion, the one displayed on the LLM Leaderboard would be a better choice because people generally want to know those results. Also, that can confuse things a little. On the other hand, the other metrics can show a more detailed version of the results.
How to report the MMLU results? There are 57 different hendrycksTest datasets for a total of 228 metrics? π΅
- Hmm, I think something like 'Overall MMLU' could work, but I'm not sure about that.
- For the leaderboard, only one metric (the reported one as @Weyaxi suggested) should be enough, especially if you provide the hyperlink to the details
- From a first look:
- ARC: OK
- HellaSwag: dataset = hellaswag, split = validation
- DROP: they actually apply a post process to the drop dataset in the harness but I think saying drop should be fine anyway, split = validation
- TruthfulQA: OK
- GSM8K: config = main
- MMLU: dataset = cais/mmlu, config = all of them (if you want to provide the list it's in the about of the leaderboard), split = test
- Winogrande: dataset = winogrande, config = winogrande_xl, split = validation
- For MMLU, we report the average of all acc scores, so "Aggregated MMLU", with as metric "avg(acc)" for example - People wanting to get the detail should go read it themselves as it's just going to be overwhelming elsewise
Thanks both for the feedback!
I pushed changes in 5ae48397:
- only 1 metric per benchmark (keeping the one on the leaderboard as suggested)
- add MMLU results => keep only 1 global result
- add Winogrande (thanks @clefourrier for noticing it was missing :D)
- corrected the few dataset/config/split that were not accurate.
Looks like we have a good final version now :)
Thanks a lot for adding this clean evaluation index! I think for AlpacaEval we can point the type
to https://huggingface.co/datasets/tatsu-lab/alpaca_eval
Apart from that, this LGTM π₯
Thanks everyone for the feedback! Let's merge this :)
great PR all!