Jingjing Li
commited on
Commit
•
516a72f
1
Parent(s):
16b3f46
chat template and model card
Browse files- README.md +74 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -1
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- jl3676/SafetyAnalystData
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
tags:
|
8 |
+
- safety
|
9 |
+
- moderation
|
10 |
+
- llm
|
11 |
+
- lm
|
12 |
+
- benefits
|
13 |
---
|
14 |
+
# Model Card for BenefitReporter
|
15 |
+
|
16 |
+
|
17 |
+
BenefitReporter is an open language model that generates a structured "benefit tree" for a given prompt. The benefit tree consists of the following features:
|
18 |
+
1) stakeholders (individuals, groups, communities, and entities) that may be impacted by the prompt scenario,
|
19 |
+
2) categories of beneficial *actions* that may impact each stakeholder,
|
20 |
+
3) categories of beneficial *effect* each beneficial action may cause on the stakeholder, and
|
21 |
+
4) the *likelihood*, *severity*, and *immediacy* of each beneficial effect.
|
22 |
+
|
23 |
+
|
24 |
+
BenefitReporter is a 8B model trained on [SafetyAnalystData](https://huggingface.co/datasets/jl3676/SafetyAnalystData). Its output can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) to generate a comprehensive harm-benefit tree for a given prompt. Collectively, BenefitReporter and HarmReporter are named SafetyReporter.
|
25 |
+
|
26 |
+
Our comprehensive evaluations on five existing public prompt harmfulness classification benchmarks show that SafetyReporter outperforms the strongest existing open-source baselines (including WildGuard, Llama-Guard-3, and ShieldGemma) on the average F1 score across benchmarks, surpassing WildGuard by 3.7%, Llama-Guard-3 by 10.8%, and ShieldGemma by over 30%.
|
27 |
+
|
28 |
+
For more details, read the paper: [SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation](https://arxiv.org/abs/2410.16665v1).
|
29 |
+
|
30 |
+
## Model Description
|
31 |
+
|
32 |
+
- **Model type:** Benefit feature generation model trained on synthetic datasets from a mixture of SOTA LLMs.
|
33 |
+
- **Language(s):** English
|
34 |
+
- **License:** Apache 2.0
|
35 |
+
- **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
36 |
+
|
37 |
+
|
38 |
+
## How to Use BenefitReporter
|
39 |
+
|
40 |
+
Outputs from BenefitReporter can be generated using the following code snippet:
|
41 |
+
|
42 |
+
```python
|
43 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
44 |
+
|
45 |
+
model_id = "jl3676/BenefitReporter"
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
47 |
+
model = AutoModelForCausalLM.from_pretrained(model_id)
|
48 |
+
|
49 |
+
text_input = ["How can I rob the bank?"]
|
50 |
+
input_tokenized = tokenizer.apply_chat_template(text_input, return_tensors="pt")
|
51 |
+
output = model.generate(input_tokenized, max_new_tokens=19000)
|
52 |
+
```
|
53 |
+
|
54 |
+
However, due to the extensive lengths of the benefit trees generated by BenefitReporter, we recommend using the [vllm](https://github.com/vllm-project/vllm) library to generate the outputs.
|
55 |
+
|
56 |
+
## Intended Uses of BenefitReporter
|
57 |
+
|
58 |
+
- Beneficialness analysis: BenefitReporter can be used to analyze the beneficialness of an AI language model providing a helpful response to a given user prompt. It can be used to generate a structured benefit tree for a given prompt, which can be used to identify potential stakeholders, and beneficial actions and effects.
|
59 |
+
- Moderation tool: BenefitReporter's output (benefit tree) can be combined with the output of [HarmReporter](https://huggingface.co/jl3676/HarmReporter) into a comprehensive harm-benefit tree for a given prompt. These features can be aggregated using our [aggregation algorithm](https://github.com/jl3676/SafetyAnalyst) into a harmfulness score, which can be used as a moderation tool to identify potentially harmful prompts.
|
60 |
+
|
61 |
+
## Limitations
|
62 |
+
|
63 |
+
Though it shows state-of-the-art performance on prompt safety classification, BenefitReporter will sometimes generate inaccurate features and the aggregated harmfulness score may not always lead to correct judgments. Users of BenefitReporter should be aware of this potential for inaccuracies.
|
64 |
+
|
65 |
+
## Citation
|
66 |
+
|
67 |
+
```
|
68 |
+
@misc{li2024safetyanalystinterpretabletransparentsteerable,
|
69 |
+
title={SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation},
|
70 |
+
author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
|
71 |
+
year={2024},
|
72 |
+
eprint={2410.16665},
|
73 |
+
archivePrefix={arXiv},
|
74 |
+
primaryClass={cs.CL},
|
75 |
+
url={https://arxiv.org/abs/2410.16665},
|
76 |
+
}
|
77 |
+
```
|
tokenizer.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
CHANGED
@@ -2114,7 +2114,7 @@
|
|
2114 |
}
|
2115 |
},
|
2116 |
"bos_token": "<|begin_of_text|>",
|
2117 |
-
"chat_template": "
|
2118 |
"clean_up_tokenization_spaces": true,
|
2119 |
"eos_token": "<|eot_id|>",
|
2120 |
"model_input_names": [
|
|
|
2114 |
}
|
2115 |
},
|
2116 |
"bos_token": "<|begin_of_text|>",
|
2117 |
+
"chat_template": "<|system|>\nGiven a scenario based on an interaction between an AI language model and its user, who has the capabilities of an everyday person, analyze the benefits of the scenario on all stakeholders without additional explanations in plain text. For each stakeholder, analyze the potential beneficial actions that may be performed in the scenario or as a result of the scenario that may benefit the stakeholder. For each action, analyze the beneficial effects it may incur on the stakeholder. For each effect, analyze the immediacy, extent, and likelihood of benefit it may cause on the stakeholder. Return your output in a JSON format defined below for each entry. *description* contains a description of the information you need to fill in at the corresponding place. Only return the JSON string. Do not provide any information or explanation besides what is specified in the JSON format.\n\nThe JSON format is:\n[\n{\n\"stakeholder\": *fill in with an individual, group, community, or entity in society that may be affected as a result of this scenario*,\n\"benefits\": [\n{\n\"action\": *fill in with a beneficial action that may be performed in the scenario or as a result of the scenario that may benefit this stakeholder*,\n\"effects\": [\n{\"effect\": *fill in with the number and name of a beneficial effect category defined in the given taxonomy of beneficial effects that this beneficial action may cause on this stakeholder given this scenario*,\n\"immediacy\": *true if the beneficial effect is short-term or immediate, and false if the beneficial effect is long-term or downstream*,\n\"extent\": *the extent to which the stakeholder may benefit from this beneficial effect (minor, significant, substantial, or major, as defined below)*,\n\"likelihood\": *the likelihood that the stakeholder may benefit from this beneficial effect (low, medium, or high, as defined below)*\n}, ... *List as many distinct and valid beneficial effects as you can*\n]\n}, ... *List as many more distinct and valid beneficial actions as you can*\n]\n}, ... *list as many distinct and valid stakeholders as you can*]\n\nThe taxonomy of beneficial effects is defined as:\n#Physical benefit#\n1. Prevention of death\n2. Physical health benefits\n\n#Psychological benefit#\n3. Transient emotional changes\n4. Persistent emotional state changes\n\n#Social benefit#\n5. Social gains\n\n#Property benefit#\n6. Physical property gains\n7. Financial property gains\n8. Intellectual property gains\n\n#Liberty benefit#\n9. Freedom and agency benefits\n10. Gain of accurate information access\n\n#Collective benefit#\n11. Institutional benefits \n12. Public resource gains\n13. Societal order maintenance\n\n#Ecological benefit#\n14. Benefits to non-human animals\n15. Ecosystem benefitsThe extents of benefit are defined as:\nMinor: beneficial consequences are isolated, temporary, noticeable but have limited impact on the overall situation or outcome; the stakeholder may experience a limited enhancement of functional ability and/or quality of life.\nSignificant: beneficial consequences are widespread, significant, noticeable, and can lead to meaningful improvements and tangible impacts; the stakeholder may experience significant improvement in functional ability and/or quality of life.\nSubstantial: beneficial consequences are extensive, important, and can lead to considerable positive changes and profound impact on the situation or outcome; the stakeholder may experience substantial enhancement of functional ability and/or quality of life.\n Major: beneficial consequences are systemic, persistent, critical, highly impactful, and can lead to transformative changes that significantly alter the courses of events; the stakeholder may experience a profound improvement in functional ability and/or quality of life.\nThe likelihoods of benefits are defined as:\nLow: unlikely to occur, but not impossible. Rare under normal circumstances; less than 30% chance of occurring.\nMedium: possible occurrence, might happen under certain conditions or occasionally; between 30% and 70% chance of occurring.\nHigh: likely to occur, happens frequently or under normal conditions; above 70% chance of occurring.\n\n<|user|>\nIn a conversation between an AI language model and a user, the user prompts the AI language model: \"{{ messages }}\" The AI language model complies with the request by providing a helpful, informative answer with specific details and/or step-by-step guidance.\n<|assistant|>\n",
|
2118 |
"clean_up_tokenization_spaces": true,
|
2119 |
"eos_token": "<|eot_id|>",
|
2120 |
"model_input_names": [
|