nvidia
/

Nemotron-4-340B-Instruct

NeMo

Model card Files Files and versions Community

okuchaiev commited on Jun 14, 2024

Commit

70af0cf

verified ·

1 Parent(s): fd0705b

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -241,7 +241,7 @@ Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix
 #### IFEval
-Evaluated using the Instruction Following Eval (IFEval) introduced in [Instruction-Following Evaluation for Large Language Models](https://arxiv.org/pdf/2311.07911).
 | Prompt-Strict Acc | Instruction-Strict Acc |
 | :----------------------- | :---------------------------- |
@@ -249,7 +249,7 @@ Evaluated using the Instruction Following Eval (IFEval) introduced in [Instructi
 #### MMLU
-Evaluated using the Multi-task Language Understanding benchmarks as introduced in [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300)
 |MMLU 0-shot |
 | :----------------- |
@@ -257,7 +257,7 @@ Evaluated using the Multi-task Language Understanding benchmarks as introduced i
 #### GSM8K
-Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Training Verifiers to Solve Math Word Problems](https://arxiv.org/pdf/2110.14168v2).
 | GSM8K 0-shot |
 | :----------------- |
@@ -265,7 +265,7 @@ Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Tra
 #### HumanEval
-Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374).
 | HumanEval 0-shot |
@@ -274,7 +274,7 @@ Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Langu
 #### MBPP
-Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) paper.
 | MBPP 0-shot|
 | :----------------- |
@@ -283,7 +283,7 @@ Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with La
 #### Arena Hard
-Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) from the LMSys Org.
 | Arena Hard |
 | :----------------- |
@@ -291,7 +291,7 @@ Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-aren
 #### AlpacaEval 2.0 LC
-Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: [Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators](https://arxiv.org/abs/2404.04475)
 | AlpacaEval 2.0 LC|
 | :----------------- |
@@ -300,7 +300,7 @@ Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the p
 #### TFEval
-Evaluated using the CantTalkAboutThis Dataset as introduced in the [CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues](https://arxiv.org/abs/2404.03820) paper.
 | Distractor F1 | On-topic F1 |
 | :----------------------- | :---------------------------- |

 #### IFEval
+Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.
 | Prompt-Strict Acc | Instruction-Strict Acc |
 | :----------------------- | :---------------------------- |
 #### MMLU
+Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.
 |MMLU 0-shot |
 | :----------------- |
 #### GSM8K
+Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in Training Verifiers to Solve Math Word Problems.
 | GSM8K 0-shot |
 | :----------------- |
 #### HumanEval
+Evaluated using the HumanEval benchmark as introduced in Evaluating Large Language Models Trained on Code.
 | HumanEval 0-shot |
 #### MBPP
+Evaluated using the MBPP Dataset as introduced in the Program Synthesis with Large Language Models.
 | MBPP 0-shot|
 | :----------------- |
 #### Arena Hard
+Evaluated using the Arena-Hard Pipeline from the LMSys Org.
 | Arena Hard |
 | :----------------- |
 #### AlpacaEval 2.0 LC
+Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
 | AlpacaEval 2.0 LC|
 | :----------------- |
 #### TFEval
+Evaluated using the CantTalkAboutThis Dataset as introduced in the CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues.
 | Distractor F1 | On-topic F1 |
 | :----------------------- | :---------------------------- |