kawine commited on
Commit
e53c18c
·
1 Parent(s): 83e89b3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - stanfordnlp/SHP
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ tags:
10
+ - human feedback
11
+ - rlhf
12
+ - preferences
13
+ - reddit
14
+ - preference model
15
+ - RL
16
+ ---
17
+
18
+ # SteamSHP
19
+
20
+ <!-- Provide a quick summary of what the model is/does. -->
21
+
22
+ SteamSHP is a preference model trained to predict human preferences, given some context and two possible responses.
23
+ It can be used for NLG evaluation or to train a smaller reward model for RLHF.
24
+
25
+ It is a FLAN-T5-xl model (3B parameters) finetuned on:
26
+ 1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains aggregate human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.)
27
+ 2. The helpfulness data in [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.
28
+
29
+ ## Training and Evaluation
30
+
31
+ SteamSHP was only finetuned on 125K of the 392K training examples that were available, since we found that:
32
+ 1. When the total input length exceeded the limit (512 tokens), the loss would not converge.
33
+ When possible, we crammed an example into 500 tokens by truncating the context as much as possible, though some examples would still not fit.
34
+ 2. Training on fewer preferences with a stronger signal led to better performance than training on all the preferences.
35
+ From the SHP dataset, we only used preferences where the more preferred comment was twice as preferred as the other (i.e., `score_ratio` >= 2) and used no more than 5 preferences from each context (i.e., `post_id`) to prevent ovefitting.
36
+
37
+ We evaluated the model on the SHP and HH-RLHF test data using accuracies, but only on the data that could be truncated to fit within 500 tokens (a total of 18621 examples).
38
+ SteamSHP gets an average 72.8% accuracy across all domains:
39
+
40
+ | Domain | Accuracy |
41
+ | ------ | -------- |
42
+ | askculinary | 0.7199 |
43
+ | askhr | 0.7743 |
44
+ | askdocs | 0.7210 |
45
+ | askanthropology | 0.7594 |
46
+ | asksciencefiction | 0.7283 |
47
+ | askacademia | 0.7442 |
48
+ | askengineers | 0.7183 |
49
+ | legaladvice | 0.8068 |
50
+ | explainlikeimfive | 0.7392 |
51
+ | askbaking | 0.6741 |
52
+ | askphysics | 0.8000 |
53
+ | askscience | 0.7114 |
54
+ | askphilosophy | 0.6907 |
55
+ | askvet | 0.7742 |
56
+ | changemyview | 0.7043 |
57
+ | askcarguys | 0.7568 |
58
+ | askhistorians | 0.7476 |
59
+ | asksocialscience | 0.7308 |
60
+ | anthropic (helpfulness) | 0.7310 |
61
+ | ALL | 0.7278 |
62
+
63
+ ## Usage
64
+
65
+ Here's how to load the model:
66
+
67
+ ```python
68
+
69
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
70
+
71
+ tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-preference-model')
72
+ model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-preference-model')
73
+ ```
74
+
75
+ The input text should be of the format:
76
+
77
+ ```
78
+ POST: { the context, such as the 'history' column in SHP }
79
+
80
+ RESPONSE A: { first possible continuation }
81
+
82
+ RESPONSE B: { second possible continuation }
83
+
84
+ Which response is better? RESPONSE
85
+ ```
86
+
87
+ The output generated by SteamSHP will either be `A` or `B`.
88
+
89
+ If the input exceeds the 512 token limit, you can use [pybsd](https://github.com/nipunsadvilkar/pySBD) to break the input up into sentences and only include that fits into 512 tokens.
90
+
91
+
92
+ ### Biases and Limitations
93
+
94
+ Biases in the datasets used to train SteamSHP may be propagated downstream to the model predictions.
95
+ Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language.
96
+ Reddit users on the subreddits covered by SHP are also not representative of the broader population. They are disproportionately from developed, Western, and English-speaking countries.
97
+
98
+ It is also worth noting that the more preferred response in SHP or HH-RLHF is not necessarily the more correct one -- they just reflect a preference.
99
+ [Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
100
+
101
+
102
+ ## Contact
103
+
104
+ Please contact [email protected] if you have any questions about the model.
105
+ This dataset was created by Kawin Ethayarajh, Heidi (Chenyu) Zhang, Yizhong Wang, and Dan Jurafsky.
106
+
107
+
108
+ ## Citation
109
+
110
+ We will have a paper out soon, but until then, please cite:
111
+
112
+ ```
113
+ @online{SHP,
114
+ author = {Ethayarajh, Kawin and Zhang, Heidi and Wang, Yizhong and Jurafsky, Dan},
115
+ title = {Stanford Human Preferences Dataset},
116
+ year = 2023,
117
+ url = {https://huggingface.co/datasets/stanfordnlp/SHP},
118
+ }
119
+ ```