Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ license: cc-by-nc-nd-3.0
|
|
6 |
## Introduction
|
7 |
We release a state-of-the-art instruct model of its class, **SFR-Iterative-DPO-LLaMA-3-8B-R**.
|
8 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|
9 |
-
and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained with open-sourced datasets without any additional human
|
10 |
|
11 |
## Model Releases
|
12 |
- [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
|
@@ -18,7 +18,9 @@ and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained w
|
|
18 |
- [Prompt collection for RLHF training]()
|
19 |
|
20 |
## Training methods
|
21 |
-
|
|
|
|
|
22 |
|
23 |
|
24 |
## Chat Benchmarks
|
|
|
6 |
## Introduction
|
7 |
We release a state-of-the-art instruct model of its class, **SFR-Iterative-DPO-LLaMA-3-8B-R**.
|
8 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|
9 |
+
and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained with open-sourced datasets without any additional human-/GPT4-labeling.
|
10 |
|
11 |
## Model Releases
|
12 |
- [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
|
|
|
18 |
- [Prompt collection for RLHF training]()
|
19 |
|
20 |
## Training methods
|
21 |
+
We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
|
22 |
+
Unlike widely-used offline DPO, the online component of our approach effectively mitigates distribution shifts during policy optimization.
|
23 |
+
For a detailed exposition, please refer to our accompanying technical report.
|
24 |
|
25 |
|
26 |
## Chat Benchmarks
|