Update README.md
Browse files
README.md
CHANGED
@@ -241,7 +241,7 @@ Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix
|
|
241 |
|
242 |
#### IFEval
|
243 |
|
244 |
-
Evaluated using the Instruction Following Eval (IFEval) introduced in
|
245 |
|
246 |
| Prompt-Strict Acc | Instruction-Strict Acc |
|
247 |
| :----------------------- | :---------------------------- |
|
@@ -249,7 +249,7 @@ Evaluated using the Instruction Following Eval (IFEval) introduced in [Instructi
|
|
249 |
|
250 |
#### MMLU
|
251 |
|
252 |
-
Evaluated using the Multi-task Language Understanding benchmarks as introduced in
|
253 |
|
254 |
|MMLU 0-shot |
|
255 |
| :----------------- |
|
@@ -257,7 +257,7 @@ Evaluated using the Multi-task Language Understanding benchmarks as introduced i
|
|
257 |
|
258 |
#### GSM8K
|
259 |
|
260 |
-
Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in
|
261 |
|
262 |
| GSM8K 0-shot |
|
263 |
| :----------------- |
|
@@ -265,7 +265,7 @@ Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Tra
|
|
265 |
|
266 |
#### HumanEval
|
267 |
|
268 |
-
Evaluated using the HumanEval benchmark as introduced in
|
269 |
|
270 |
|
271 |
| HumanEval 0-shot |
|
@@ -274,7 +274,7 @@ Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Langu
|
|
274 |
|
275 |
#### MBPP
|
276 |
|
277 |
-
Evaluated using the MBPP Dataset as introduced in the
|
278 |
|
279 |
| MBPP 0-shot|
|
280 |
| :----------------- |
|
@@ -283,7 +283,7 @@ Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with La
|
|
283 |
|
284 |
#### Arena Hard
|
285 |
|
286 |
-
Evaluated using the
|
287 |
|
288 |
| Arena Hard |
|
289 |
| :----------------- |
|
@@ -291,7 +291,7 @@ Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-aren
|
|
291 |
|
292 |
#### AlpacaEval 2.0 LC
|
293 |
|
294 |
-
Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper:
|
295 |
|
296 |
| AlpacaEval 2.0 LC|
|
297 |
| :----------------- |
|
@@ -300,7 +300,7 @@ Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the p
|
|
300 |
|
301 |
#### TFEval
|
302 |
|
303 |
-
Evaluated using the CantTalkAboutThis Dataset as introduced in the
|
304 |
|
305 |
| Distractor F1 | On-topic F1 |
|
306 |
| :----------------------- | :---------------------------- |
|
|
|
241 |
|
242 |
#### IFEval
|
243 |
|
244 |
+
Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.
|
245 |
|
246 |
| Prompt-Strict Acc | Instruction-Strict Acc |
|
247 |
| :----------------------- | :---------------------------- |
|
|
|
249 |
|
250 |
#### MMLU
|
251 |
|
252 |
+
Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.
|
253 |
|
254 |
|MMLU 0-shot |
|
255 |
| :----------------- |
|
|
|
257 |
|
258 |
#### GSM8K
|
259 |
|
260 |
+
Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in Training Verifiers to Solve Math Word Problems.
|
261 |
|
262 |
| GSM8K 0-shot |
|
263 |
| :----------------- |
|
|
|
265 |
|
266 |
#### HumanEval
|
267 |
|
268 |
+
Evaluated using the HumanEval benchmark as introduced in Evaluating Large Language Models Trained on Code.
|
269 |
|
270 |
|
271 |
| HumanEval 0-shot |
|
|
|
274 |
|
275 |
#### MBPP
|
276 |
|
277 |
+
Evaluated using the MBPP Dataset as introduced in the Program Synthesis with Large Language Models.
|
278 |
|
279 |
| MBPP 0-shot|
|
280 |
| :----------------- |
|
|
|
283 |
|
284 |
#### Arena Hard
|
285 |
|
286 |
+
Evaluated using the Arena-Hard Pipeline from the LMSys Org.
|
287 |
|
288 |
| Arena Hard |
|
289 |
| :----------------- |
|
|
|
291 |
|
292 |
#### AlpacaEval 2.0 LC
|
293 |
|
294 |
+
Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
|
295 |
|
296 |
| AlpacaEval 2.0 LC|
|
297 |
| :----------------- |
|
|
|
300 |
|
301 |
#### TFEval
|
302 |
|
303 |
+
Evaluated using the CantTalkAboutThis Dataset as introduced in the CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues.
|
304 |
|
305 |
| Distractor F1 | On-topic F1 |
|
306 |
| :----------------------- | :---------------------------- |
|