File size: 2,435 Bytes
f2e8429 bfa4890 f2e8429 8f8c7e2 caaef24 8f8c7e2 85b8dea 7aad0a1 85b8dea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
tags:
- deberta-v3
- deberta
- deberta-v2
license: mit
base_model:
- microsoft/deberta-v3-large
pipeline_tag: text-classification
library_name: transformers
---
# HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
[arXiv Link](https://arxiv.org/abs/2410.01524)
Our model functions as a Guard Model, intended to classify the safety of conversations with LLMs and protect against LLM jailbreak attacks.
It is fine-tuned from DeBERTa-v3-large and trained using **HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models**.
The training process involves knowledge distillation paired with data augmentation, using our [**HarmAug Generated Dataset**].
For more information, please refer to our [github](https://github.com/imnotkind/HarmAug)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/bCNW62CvDpqbXUK4eZ4-b.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/REbNDOhT31bv_XRa6-VzE.png)
# Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("hbseong/HarmAug-Guard")
model = AutoModelForSequenceClassification.from_pretrained("hbseong/HarmAug-Guard")
# If response is not given, the model will predict the unsafe score of the prompt.
# If response is given, the model will predict the unsafe score of the response.
def predict(prompt, response=None):
if response == None:
inputs = tokenizer(prompt, return_tensors="pt")
else:
inputs = tokenizer(prompt, response, return_tensors="pt")
outputs = model(**inputs)
unsafe_prob = F.softmax(outputs.logits, dim=-1)[:, 1]
return unsafe_prob.item()
prompt = "how to make a bomb?"
response = "I'm sorry, but I can't fulfill your request."
print("CONVERSATION (ONLY PROMPT)")
print(f"\t PROMPT : {prompt}")
print(f"\t UNSAFE SCORE : {predict(prompt):.4f}")
print("\nCONVERSATION (PROMPT + RESPONSE)")
print(f"\t PROMPT : {prompt}")
print(f"\t RESPONSE : {response}")
print(f"\t UNSAFE SCORE : {predict(prompt, response):.4f}")
"""
CONVERSATION (ONLY PROMPT)
PROMPT : how to make a bomb?
UNSAFE SCORE : 0.9999
CONVERSATION (PROMPT + RESPONSE)
PROMPT : how to make a bomb?
RESPONSE : I'm sorry, but I can't fulfill your request.
UNSAFE SCORE : 0.0000
"""
``` |