stanfordnlp
/

SteamSHP-flan-t5-xl

@@ -22,7 +22,7 @@ tags:
 <!-- Provide a quick summary of what the model is/does. -->
 SteamSHP-XL is a preference model trained to predict -- given some context and two possible responses -- which response humans will find more helpful.
-It can be used for NLG evaluation, question-answering evalation, or to train a smaller reward model for RLHF.
 It is a FLAN-T5-xl model (3B parameters) finetuned on:
 1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains collective human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
@@ -34,6 +34,8 @@ Despite being 1/4 of the size, it is on average only 0.75 points less accurate o
 ## Usage
 The input text should be of the format:
 ```
@@ -68,6 +70,40 @@ Here's how to use the model:
 If the input exceeds the 512 token limit, you can use [pybsd](https://github.com/nipunsadvilkar/pySBD) to break the input up into sentences and only include what fits into 512 tokens.
 When trying to cram an example into 512 tokens, we recommend truncating the context as much as possible and leaving the responses as untouched as possible.
 ## Training and Evaluation
@@ -105,6 +141,8 @@ SteamSHP-XL gets an average 72.8% accuracy across all domains:
 | anthropic (helpfulness) | 0.7310 |
 | ALL (unweighted) | 0.7278 |
 ## Biases and Limitations

 <!-- Provide a quick summary of what the model is/does. -->
 SteamSHP-XL is a preference model trained to predict -- given some context and two possible responses -- which response humans will find more helpful.
+It can be used for NLG evaluation or as a reward model for RLHF.
 It is a FLAN-T5-xl model (3B parameters) finetuned on:
 1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains collective human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
 ## Usage
+### Normal Usage
 The input text should be of the format:
 ```
 If the input exceeds the 512 token limit, you can use [pybsd](https://github.com/nipunsadvilkar/pySBD) to break the input up into sentences and only include what fits into 512 tokens.
 When trying to cram an example into 512 tokens, we recommend truncating the context as much as possible and leaving the responses as untouched as possible.
+### Reward Model Usage
+If you want to use SteamSHP-XL as a reward model -- to get a score for a single response -- then you need to structure the input such that RESPONSE A is what you want to score and RESPONSE B is just an empty input:
+```
+POST: { the context, such as the 'history' column in SHP }
+RESPONSE A: { continuation }
+RESPONSE B: .
+Which response is better? RESPONSE
+```
+Then calculate the probability assigned to the label A.
+This probability (or the logit, depending on what you want) is the score for the response:
+```python
+>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: .\n\n Which response is better? RESPONSE"
+>> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device)
+>> outputs = model.generate(x, return_dict_in_generate=True, output_scores=True, max_new_tokens=1)
+>> torch.exp(outputs.scores[0][:, 71]) / torch.exp(outputs.scores[0][:,:]).sum(axis=1).item() # index 71 corresponds to the token for 'A'
+0.819
+```
+The probability will almost always be high (in the range of 0.8 to 1.0), since RESPONSE B is just a null input.
+Therefore you may want to normalize the probability.
+You can also compare the two probabilities assigned independently to each response (given the same context) to infer the preference label.
+For example, if one response has probability 0.95 and the other has 0.80, the former will be preferred.
+Inferring the preference label in this way only leads to a 0.5 drop in accuracy on the SHP + HH-RLHF test data on average across all domains, meaning that there's only a very small penalty for using SteamSHP as a reward model instead of as a preference model.
 ## Training and Evaluation
 | anthropic (helpfulness) | 0.7310 |
 | ALL (unweighted) | 0.7278 |
+As mentioned previously, if you use SteamSHP as a reward model and try to infer the preference label based on the probability assigned to each response independently, that could also work!
+But doing so will lead to a 0.5 drop in accuracy on the test data (on average across all domains), meaning that there is a small penalty.
 ## Biases and Limitations