Update README.md
Browse files
README.md
CHANGED
@@ -107,7 +107,7 @@ SteamSHP-XL gets an average 72.8% accuracy across all domains:
|
|
107 |
|
108 |
|
109 |
|
110 |
-
|
111 |
|
112 |
SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
|
113 |
It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
|
@@ -119,6 +119,8 @@ The responses that humans collectively found more helpful are also not guarantee
|
|
119 |
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
|
120 |
Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
|
121 |
|
|
|
|
|
122 |
|
123 |
## Contact
|
124 |
|
|
|
107 |
|
108 |
|
109 |
|
110 |
+
## Biases and Limitations
|
111 |
|
112 |
SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
|
113 |
It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
|
|
|
119 |
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
|
120 |
Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
|
121 |
|
122 |
+
[Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
|
123 |
+
|
124 |
|
125 |
## Contact
|
126 |
|