NeMo
okuchaiev commited on
Commit
70af0cf
·
verified ·
1 Parent(s): fd0705b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -241,7 +241,7 @@ Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix
241
 
242
  #### IFEval
243
 
244
- Evaluated using the Instruction Following Eval (IFEval) introduced in [Instruction-Following Evaluation for Large Language Models](https://arxiv.org/pdf/2311.07911).
245
 
246
  | Prompt-Strict Acc | Instruction-Strict Acc |
247
  | :----------------------- | :---------------------------- |
@@ -249,7 +249,7 @@ Evaluated using the Instruction Following Eval (IFEval) introduced in [Instructi
249
 
250
  #### MMLU
251
 
252
- Evaluated using the Multi-task Language Understanding benchmarks as introduced in [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300)
253
 
254
  |MMLU 0-shot |
255
  | :----------------- |
@@ -257,7 +257,7 @@ Evaluated using the Multi-task Language Understanding benchmarks as introduced i
257
 
258
  #### GSM8K
259
 
260
- Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Training Verifiers to Solve Math Word Problems](https://arxiv.org/pdf/2110.14168v2).
261
 
262
  | GSM8K 0-shot |
263
  | :----------------- |
@@ -265,7 +265,7 @@ Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Tra
265
 
266
  #### HumanEval
267
 
268
- Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374).
269
 
270
 
271
  | HumanEval 0-shot |
@@ -274,7 +274,7 @@ Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Langu
274
 
275
  #### MBPP
276
 
277
- Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) paper.
278
 
279
  | MBPP 0-shot|
280
  | :----------------- |
@@ -283,7 +283,7 @@ Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with La
283
 
284
  #### Arena Hard
285
 
286
- Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) from the LMSys Org.
287
 
288
  | Arena Hard |
289
  | :----------------- |
@@ -291,7 +291,7 @@ Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-aren
291
 
292
  #### AlpacaEval 2.0 LC
293
 
294
- Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: [Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators](https://arxiv.org/abs/2404.04475)
295
 
296
  | AlpacaEval 2.0 LC|
297
  | :----------------- |
@@ -300,7 +300,7 @@ Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the p
300
 
301
  #### TFEval
302
 
303
- Evaluated using the CantTalkAboutThis Dataset as introduced in the [CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues](https://arxiv.org/abs/2404.03820) paper.
304
 
305
  | Distractor F1 | On-topic F1 |
306
  | :----------------------- | :---------------------------- |
 
241
 
242
  #### IFEval
243
 
244
+ Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.
245
 
246
  | Prompt-Strict Acc | Instruction-Strict Acc |
247
  | :----------------------- | :---------------------------- |
 
249
 
250
  #### MMLU
251
 
252
+ Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.
253
 
254
  |MMLU 0-shot |
255
  | :----------------- |
 
257
 
258
  #### GSM8K
259
 
260
+ Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in Training Verifiers to Solve Math Word Problems.
261
 
262
  | GSM8K 0-shot |
263
  | :----------------- |
 
265
 
266
  #### HumanEval
267
 
268
+ Evaluated using the HumanEval benchmark as introduced in Evaluating Large Language Models Trained on Code.
269
 
270
 
271
  | HumanEval 0-shot |
 
274
 
275
  #### MBPP
276
 
277
+ Evaluated using the MBPP Dataset as introduced in the Program Synthesis with Large Language Models.
278
 
279
  | MBPP 0-shot|
280
  | :----------------- |
 
283
 
284
  #### Arena Hard
285
 
286
+ Evaluated using the Arena-Hard Pipeline from the LMSys Org.
287
 
288
  | Arena Hard |
289
  | :----------------- |
 
291
 
292
  #### AlpacaEval 2.0 LC
293
 
294
+ Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
295
 
296
  | AlpacaEval 2.0 LC|
297
  | :----------------- |
 
300
 
301
  #### TFEval
302
 
303
+ Evaluated using the CantTalkAboutThis Dataset as introduced in the CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues.
304
 
305
  | Distractor F1 | On-topic F1 |
306
  | :----------------------- | :---------------------------- |