hendrydong commited on
Commit
4871bd2
·
verified ·
1 Parent(s): cdaa737

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -6
README.md CHANGED
@@ -10,12 +10,9 @@ and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained w
10
 
11
  ## Model Releases
12
  - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
13
- - [Reward model](https://huggingface.co/Salesforce)
14
  - [RLHF model](https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R)
15
 
16
- ## Dataset Releases
17
- - [Preference data mix]()
18
- - [Prompt collection for RLHF training]()
19
 
20
  ## Training methods
21
  We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
@@ -95,6 +92,23 @@ We are committed to continuous improvement in our models to minimize such risks
95
 
96
  ## Citation
97
  Please cite our techical report if you find our model is useful for your research or product.
98
- ```
99
- @article{}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ```
 
10
 
11
  ## Model Releases
12
  - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
13
+ - [Reward model](https://huggingface.co/Salesforce/SFR-RM-LLaMA-3-8B-R)
14
  - [RLHF model](https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R)
15
 
 
 
 
16
 
17
  ## Training methods
18
  We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
 
92
 
93
  ## Citation
94
  Please cite our techical report if you find our model is useful for your research or product.
95
+
96
+ ```bibtex
97
+ @misc{dong2024rlhf,
98
+ title={RLHF Workflow: From Reward Modeling to Online RLHF},
99
+ author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
100
+ year={2024},
101
+ eprint={2405.07863},
102
+ archivePrefix={arXiv},
103
+ primaryClass={cs.LG}
104
+ }
105
+
106
+ @misc{xiong2024iterative,
107
+ title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
108
+ author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
109
+ year={2024},
110
+ eprint={2312.11456},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.LG}
113
+ }
114
  ```