bpucla commited on
Commit
b715281
·
verified ·
1 Parent(s): fc5d28a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -6,7 +6,7 @@ license: cc-by-nc-nd-3.0
6
  ## Introduction
7
  We release a state-of-the-art instruct model of its class, **SFR-Iterative-DPO-LLaMA-3-8B-R**.
8
  On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
9
- and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained with open-sourced datasets without any additional human- or GPT4-labeling.
10
 
11
  ## Model Releases
12
  - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
@@ -18,7 +18,9 @@ and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained w
18
  - [Prompt collection for RLHF training]()
19
 
20
  ## Training methods
21
- The key to our training is iterative RLHF.
 
 
22
 
23
 
24
  ## Chat Benchmarks
 
6
  ## Introduction
7
  We release a state-of-the-art instruct model of its class, **SFR-Iterative-DPO-LLaMA-3-8B-R**.
8
  On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
9
+ and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained with open-sourced datasets without any additional human-/GPT4-labeling.
10
 
11
  ## Model Releases
12
  - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
 
18
  - [Prompt collection for RLHF training]()
19
 
20
  ## Training methods
21
+ We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
22
+ Unlike widely-used offline DPO, the online component of our approach effectively mitigates distribution shifts during policy optimization.
23
+ For a detailed exposition, please refer to our accompanying technical report.
24
 
25
 
26
  ## Chat Benchmarks