|
--- |
|
model-index: |
|
- name: roberta-large-self-disclosure-sentence-classification |
|
results: [] |
|
language: |
|
- en |
|
base_model: FacebookAI/roberta-large |
|
license: cc-by-nc-2.0 |
|
tags: |
|
- roberta |
|
- privacy |
|
- self-disclosure classification |
|
- PII |
|
--- |
|
|
|
# Model Card for roberta-large-self-disclosure-sentence-classification |
|
|
|
The model is used to classify whether a given sentence contains disclosure or not. It is a binary sentence-level classification where label 1 means containing self-disclosure, and 0 means not containing. |
|
|
|
For more details, please read the paper: [Reducing Privacy Risks in Online Self-Disclosures with Language Models |
|
](https://arxiv.org/abs/2311.09538). |
|
|
|
#### Accessing this model implies automatic agreement to the following guidelines: |
|
1. Only use the model for research purposes. |
|
2. No redistribution without the author's agreement. |
|
3. Any derivative works created using this model must acknowledge the original author. |
|
|
|
### Model Description |
|
|
|
- **Model type:** A finetuned sentence level classifier that classifies whether a given sentence contains disclosure or not. |
|
- **Language(s) (NLP):** English |
|
- **License:** Creative Commons Attribution-NonCommercial |
|
- **Finetuned from model:** [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large) |
|
|
|
|
|
### Example Code |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig |
|
|
|
config = AutoConfig.from_pretrained("douy/roberta-large-self-disclosure-sentence-classification") |
|
tokenizer = AutoTokenizer.from_pretrained("douy/roberta-large-self-disclosure-sentence-classification") |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("douy/roberta-large-self-disclosure-sentence-classification", |
|
config=config, device_map="cuda:0").eval() |
|
|
|
sentences = [ |
|
"I am a 23-year-old who is currently going through the last leg of undergraduate school.", |
|
"There is a joke in the design industry about that.", |
|
"My husband and I live in US.", |
|
"I was messing with advanced voice the other day and I was like, 'Oh, I can do this.'", |
|
] |
|
|
|
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(model.device) |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
# predicted is the argmax of each row |
|
predicted_class = logits.argmax(dim=-1) |
|
|
|
# 1 means the sentence contains self-disclosure |
|
# 0 means the sentence does not contain self-disclosure |
|
|
|
# predicted_class: tensor([1, 0, 1, 0], device='cuda:0') |
|
``` |
|
|
|
### Evaluation |
|
The model achieves 88.6% accuracy. |
|
|
|
## Citation |
|
``` |
|
@article{dou2023reducing, |
|
title={Reducing Privacy Risks in Online Self-Disclosures with Language Models}, |
|
author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei}, |
|
journal={arXiv preprint arXiv:2311.09538}, |
|
year={2023} |
|
} |
|
``` |