HEAD_TEXT = """ This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper](https://arxiv.org/abs/2408.03281) for experimental analysis. 🚀 **_Latest News_** * [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥. * [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥. """ ABOUT_TEXT = """# What is StructEval? Evaluation is the baton for the development of large language models. Current evaluations typically employ *a single-item assessment paradigm* for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as ***StructEval***. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments demonstrate that **StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases**, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols. # How to evaluate? Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework. # Contact If you have any questions, feel free to reach out to us at [boxi2020@iscas.ac.cn](mailto:boxi2020@iscas.ac.cn). """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" comming soon """ ACKNOWLEDGEMENT_TEXT = """ Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). """ NOTES_TEXT = """ * Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data. * On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use [opencompass](https://opencompass.org.cn/home) for evaluation. * For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks. """