Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ This model is trained on the UltraFeedback dataset (using the per-aspect/fine-gr
|
|
22 |
We used a 70B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
|
23 |
|
24 |
For more details, read the paper:
|
25 |
-
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://
|
26 |
|
27 |
|
28 |
## .Model description
|
@@ -44,7 +44,7 @@ For more details, read the paper:
|
|
44 |
|
45 |
Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
|
46 |
It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
|
47 |
-
For details on training and evaluation, read [our paper](https://
|
48 |
|
49 |
|
50 |
| Model | Size | Alignment | GSM8k 8-shot CoT Acc. | AlpacaEval 2 Winrate (LC) | Average Perf. across Open-Instruct evals |
|
@@ -125,6 +125,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
|
|
125 |
title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
|
126 |
author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
|
127 |
year={2024},
|
|
|
128 |
archivePrefix={arXiv},
|
129 |
primaryClass={cs.CL}
|
130 |
}
|
|
|
22 |
We used a 70B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
|
23 |
|
24 |
For more details, read the paper:
|
25 |
+
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
26 |
|
27 |
|
28 |
## .Model description
|
|
|
44 |
|
45 |
Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
|
46 |
It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
|
47 |
+
For details on training and evaluation, read [our paper](https://arxiv.org/abs/2406.09279)!
|
48 |
|
49 |
|
50 |
| Model | Size | Alignment | GSM8k 8-shot CoT Acc. | AlpacaEval 2 Winrate (LC) | Average Perf. across Open-Instruct evals |
|
|
|
125 |
title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
|
126 |
author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
|
127 |
year={2024},
|
128 |
+
eprint={2406.09279},
|
129 |
archivePrefix={arXiv},
|
130 |
primaryClass={cs.CL}
|
131 |
}
|