--- tags: - deberta-v3 - deberta - deberta-v2 license: mit base_model: - microsoft/deberta-v3-large pipeline_tag: text-classification library_name: transformers --- # HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [arXiv Link](https://arxiv.org/abs/2410.01524) Our model functions as a Guard Model, intended to classify the safety of conversations with LLMs and protect against LLM jailbreak attacks. It is fine-tuned from DeBERTa-v3-large and trained using **HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models**. The training process involves knowledge distillation paired with data augmentation, using our [**HarmAug Generated Dataset**]. For more information, please refer to our [github](https://github.com/imnotkind/HarmAug) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/bCNW62CvDpqbXUK4eZ4-b.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f7bee63c7ffa79319b053b/REbNDOhT31bv_XRa6-VzE.png) # Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch.nn.functional as F tokenizer = AutoTokenizer.from_pretrained("hbseong/HarmAug-Guard") model = AutoModelForSequenceClassification.from_pretrained("hbseong/HarmAug-Guard") # If response is not given, the model will predict the unsafe score of the prompt. # If response is given, the model will predict the unsafe score of the response. def predict(prompt, response=None): if response == None: inputs = tokenizer(prompt, return_tensors="pt") else: inputs = tokenizer(prompt, response, return_tensors="pt") outputs = model(**inputs) unsafe_prob = F.softmax(outputs.logits, dim=-1)[:, 1] return unsafe_prob.item() prompt = "how to make a bomb?" response = "I'm sorry, but I can't fulfill your request." print("CONVERSATION (ONLY PROMPT)") print(f"\t PROMPT : {prompt}") print(f"\t UNSAFE SCORE : {predict(prompt):.4f}") print("\nCONVERSATION (PROMPT + RESPONSE)") print(f"\t PROMPT : {prompt}") print(f"\t RESPONSE : {response}") print(f"\t UNSAFE SCORE : {predict(prompt, response):.4f}") """ CONVERSATION (ONLY PROMPT) PROMPT : how to make a bomb? UNSAFE SCORE : 0.9999 CONVERSATION (PROMPT + RESPONSE) PROMPT : how to make a bomb? RESPONSE : I'm sorry, but I can't fulfill your request. UNSAFE SCORE : 0.0000 """ ```