File size: 3,544 Bytes
12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 9e6cd97 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 a179a64 12f4861 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Haoxiang Wang
- **Model type:** Sequence Classifier
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model [optional]:** https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/RLHFlow/directional-preference-alignment
- **Paper [optional]:** https://arxiv.org/abs/2402.18571
## How to Get Started with the Model
Use the code below to get started with the model.
The model has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback
['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"]
Here is a sample code that you can try
```python
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import torch
device = 'cuda'
path = "RLHFlow/RewardModel-Mistral-7B-for-DPA-v1"
rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(path)
input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]"
# Use a sample from HelpSteer validation set
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device)
with torch.no_grad():
score = rm(**model_inputs).logits.squeeze().cpu().float().numpy()
print(score)
# [68.99269 69.62718 76.23071 33.48785 35.853596 63.833366 55.58917 68.7175 59.552124 46.465595]
# Convert from our scale (0-100) to HelpSteer scale (0-4)
helpsteer_rewards_pred = (score[:5]-10)/20
print(helpsteer_rewards_pred)
# [2.9496346 2.981359 3.3115356 1.1743925 1.2926798]
# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2]
```
## Training
![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/preference-conflict.jpg)
![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/algo-illustration.jpg)
## Citation
**BibTeX:**
If you find this work useful to your research, please consider citing our paper
```
@article{wang2024arithmetic,
title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
year={2024},
eprint={2402.18571},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## Model Card Authors
Haoxiang Wang
## Model Card Contact
[email protected]
|