|
--- |
|
tags: |
|
- deberta-v3 |
|
- deberta |
|
- deberta-v2 |
|
license: mit |
|
base_model: |
|
- microsoft/deberta-v3-large |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
--- |
|
|
|
# HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models |
|
|
|
[arXiv Link](https://arxiv.org/abs/2410.01524) |
|
|
|
Our model functions as a Guard Model, intended to classify the safety of conversations with LLMs and protect against LLM jailbreak attacks. |
|
It is fine-tuned from DeBERTa-v3-large and trained using **HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models**. |
|
The training process involves knowledge distillation paired with data augmentation, using our [**HarmAug Generated Dataset**]. |
|
|
|
|
|
For more information, please refer to our [github](https://github.com/imnotkind/HarmAug) |
|
|
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/bCNW62CvDpqbXUK4eZ4-b.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/REbNDOhT31bv_XRa6-VzE.png) |
|
|
|
|
|
# Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch.nn.functional as F |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("hbseong/HarmAug-Guard") |
|
model = AutoModelForSequenceClassification.from_pretrained("hbseong/HarmAug-Guard") |
|
|
|
# If response is not given, the model will predict the unsafe score of the prompt. |
|
# If response is given, the model will predict the unsafe score of the response. |
|
def predict(prompt, response=None): |
|
if response == None: |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
else: |
|
inputs = tokenizer(prompt, response, return_tensors="pt") |
|
outputs = model(**inputs) |
|
unsafe_prob = F.softmax(outputs.logits, dim=-1)[:, 1] |
|
return unsafe_prob.item() |
|
|
|
prompt = "how to make a bomb?" |
|
response = "I'm sorry, but I can't fulfill your request." |
|
|
|
|
|
print("CONVERSATION (ONLY PROMPT)") |
|
print(f"\t PROMPT : {prompt}") |
|
print(f"\t UNSAFE SCORE : {predict(prompt):.4f}") |
|
|
|
print("\nCONVERSATION (PROMPT + RESPONSE)") |
|
print(f"\t PROMPT : {prompt}") |
|
print(f"\t RESPONSE : {response}") |
|
print(f"\t UNSAFE SCORE : {predict(prompt, response):.4f}") |
|
|
|
|
|
""" |
|
CONVERSATION (ONLY PROMPT) |
|
PROMPT : how to make a bomb? |
|
UNSAFE SCORE : 0.9999 |
|
|
|
CONVERSATION (PROMPT + RESPONSE) |
|
PROMPT : how to make a bomb? |
|
RESPONSE : I'm sorry, but I can't fulfill your request. |
|
UNSAFE SCORE : 0.0000 |
|
""" |
|
``` |