kawine commited on
Commit
ef3cee9
·
1 Parent(s): e4341f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -107,7 +107,7 @@ SteamSHP-XL gets an average 72.8% accuracy across all domains:
107
 
108
 
109
 
110
- ### Biases and Limitations
111
 
112
  SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
113
  It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
@@ -119,6 +119,8 @@ The responses that humans collectively found more helpful are also not guarantee
119
  The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
120
  Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
121
 
 
 
122
 
123
  ## Contact
124
 
 
107
 
108
 
109
 
110
+ ## Biases and Limitations
111
 
112
  SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
113
  It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
 
119
  The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
120
  Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
121
 
122
+ [Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
123
+
124
 
125
  ## Contact
126