open_pl_llm_leaderboard

Restarting on CPU Upgrade

App Files Files Community

djstrong commited on Mar 25, 2024

Commit

39d6a74

1 Parent(s): 96fbe7c

update description

Browse files

Files changed (1) hide show

src/about.py +7 -7

src/about.py CHANGED Viewed

@@ -33,10 +33,10 @@ class Tasks(Enum):
     task17 = Task("polish_cbd_regex", "f1,score-first", "cbd_g", "generate_until", 0.149)
     task18 = Task("polish_klej_ner_multiple_choice", "acc,none", "klej_ner_mc", "multiple_choice", 0.343)
     task19 = Task("polish_klej_ner_regex", "exact_match,score-first", "klej_ner_g", "generate_until", 0.343)
-    task20 = Task("polish_poleval2018_task3_test_10k", "word_perplexity,none", "poleval2018_task3_test_10k", "other")
     task21 = Task("polish_polqa_reranking_multiple_choice", "acc,none", "polqa_reranking_mc", "multiple_choice", 0.5335588952710677) # multiple_choice
     task22 = Task("polish_polqa_open_book", "levenshtein,none", "polqa_open_book_g", "generate_until", 0.0) # generate_until
     task23 = Task("polish_polqa_closed_book", "levenshtein,none", "polqa_closed_book_g", "generate_until", 0.0) # generate_until
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
@@ -58,9 +58,11 @@ TITLE = """
 INTRODUCTION_TEXT = """
 The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
-Almost every task has two versions: regex and multiple choice. The regex version is scored based on exact match, while the multiple choice version is scored based on accuracy.
 * _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
 * _mc suffix means that a model is scored against every possible class (suitable also for base models)
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -75,11 +77,9 @@ or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
 * fix long model names
 * add inference time
-* add metadata for models (e.g. #Params)
 * add more tasks
 * use model templates
 * fix scrolling on Firefox
-* polish_poleval2018_task3_test_10k - IN PROGRESS
 ## Tasks
@@ -103,10 +103,10 @@ or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
 | cbd_g                | ptaszynski/PolishCyberbullyingDataset | macro F1  | generate_until  |
 | klej_ner_mc | allegro/klej-nkjp-ner                 | accuracy  | multiple_choice |
 | klej_ner_g           | allegro/klej-nkjp-ner                 | accuracy  | generate_until  |
 | poleval2018_task3_test_10k | enelpol/poleval2018_task3_test_10k   | word perplexity | other |
-| polqa_reranking_mc | ipipan/polqa   | accuracy | other |
-| polqa_open_book_g | ipipan/polqa   | levenshtein | other |
-| polqa_closed_book_g | ipipan/polqa   | levenshtein | other |
 ## Reproducibility
 To reproduce our results, you need to clone the repository:

     task17 = Task("polish_cbd_regex", "f1,score-first", "cbd_g", "generate_until", 0.149)
     task18 = Task("polish_klej_ner_multiple_choice", "acc,none", "klej_ner_mc", "multiple_choice", 0.343)
     task19 = Task("polish_klej_ner_regex", "exact_match,score-first", "klej_ner_g", "generate_until", 0.343)
     task21 = Task("polish_polqa_reranking_multiple_choice", "acc,none", "polqa_reranking_mc", "multiple_choice", 0.5335588952710677) # multiple_choice
     task22 = Task("polish_polqa_open_book", "levenshtein,none", "polqa_open_book_g", "generate_until", 0.0) # generate_until
     task23 = Task("polish_polqa_closed_book", "levenshtein,none", "polqa_closed_book_g", "generate_until", 0.0) # generate_until
+    task20 = Task("polish_poleval2018_task3_test_10k", "word_perplexity,none", "poleval2018_task3_test_10k", "other")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
 INTRODUCTION_TEXT = """
 The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
+Almost every task has two versions: regex and multiple choice.
 * _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
 * _mc suffix means that a model is scored against every possible class (suitable also for base models)
+Average columns are normalized against scores by "Baseline (majority class)".
 """
 # Which evaluations are you running? how can people reproduce what you have?
 * fix long model names
 * add inference time
 * add more tasks
 * use model templates
 * fix scrolling on Firefox
 ## Tasks
 | cbd_g                | ptaszynski/PolishCyberbullyingDataset | macro F1  | generate_until  |
 | klej_ner_mc | allegro/klej-nkjp-ner                 | accuracy  | multiple_choice |
 | klej_ner_g           | allegro/klej-nkjp-ner                 | accuracy  | generate_until  |
+| polqa_reranking_mc | ipipan/polqa   | accuracy | multiple_choice |
+| polqa_open_book_g | ipipan/polqa   | levenshtein | generate_until |
+| polqa_closed_book_g | ipipan/polqa   | levenshtein | generate_until |
 | poleval2018_task3_test_10k | enelpol/poleval2018_task3_test_10k   | word perplexity | other |
 ## Reproducibility
 To reproduce our results, you need to clone the repository: