Update README.md

d85164a verified 3 months ago

5.41 kB

	---
	model-index:
	- name: deberta-v3-large-self-disclosure-detection
	results: []
	language:
	- en
	base_model: microsoft/deberta-v3-large
	license: cc-by-nc-2.0
	tags:
	- deberta
	- privacy
	- self-disclosure identification
	- PII
	---

	# Model Card for deberta-v3-large-self-disclosure-detection

	The model is used to detect self-disclosures (personal information) in a sentence. It is a binary token classification task.
	For example "I am 22 years old and ..." has labels of "["DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "O", ...]"

	The model is able to detect the following 17 categores: "Age", "Age_Gender", "Appearance", "Education", "Family", "Finance", "Gender", "Health", "Husband_BF",
	"Location", "Mental_Health", "Occupation", "Pet", "Race_Nationality", "Relationship_Status", "Sexual_Orientation", "Wife_GF".

	For more details, please read the paper: [Reducing Privacy Risks in Online Self-Disclosures with Language Models
	](https://arxiv.org/abs/2311.09538).

	#### Accessing this model implies automatic agreement to the following guidelines:
	1. Only use the model for research purposes.
	2. No redistribution without the author's agreement.
	3. Any derivative works created using this model must acknowledge the original author.

	### Model Description

	- Model type: A binary token-classification finetuned model that can detect self-disclosures
	- Language(s) (NLP): English
	- License: Creative Commons Attribution-NonCommercial
	- Finetuned from model: [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large)


	### Example Code
	```python
	import torch
	from torch.utils.data import DataLoader, Dataset

	import datasets
	from datasets import ClassLabel, load_dataset

	from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification

	model_path = "douy/deberta-v3-large-self-disclosure-detection-binary"

	config = AutoConfig.from_pretrained(model_path,)
	label2id = config.label2id
	id2label = config.id2label

	config.num_labels = 2

	tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)

	model = AutoModelForTokenClassification.from_pretrained(model_path, config=config, device_map="cuda:0").eval()

	def tokenize_and_align_labels(words):
	tokenized_inputs = tokenizer(
	words,
	padding=False,
	is_split_into_words=True,
	)

	# we use ("O") for all the labels
	word_ids = tokenized_inputs.word_ids(0)
	previous_word_idx = None
	label_ids = []
	for word_idx in word_ids:
	# Special tokens have a word id that is None. We set the label to -100 so they are automatically
	# ignored in the loss function.
	if word_idx is None:
	label_ids.append(-100)
	# We set the label for the first token of each word.
	elif word_idx != previous_word_idx:
	label_ids.append(label2id["O"])
	# For the other tokens in a word, we set the label to -100
	else:
	label_ids.append(-100)
	previous_word_idx = word_idx
	tokenized_inputs["labels"] = label_ids
	return tokenized_inputs

	class DisclosureDataset(Dataset):
	def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
	self.inputs = inputs
	self.tokenizer = tokenizer
	self.tokenize_and_align_labels_function = tokenize_and_align_labels_function

	def __len__(self):
	return len(self.inputs)

	def __getitem__(self, idx):
	words = self.inputs[idx]
	tokenized_inputs = self.tokenize_and_align_labels_function(words)
	return tokenized_inputs


	sentences = [
	"I am a 23-year-old who is currently going through the last leg of undergraduate school.",
	"We also partnered with news and data providers to add up-to-date information and new visual designs for categories like weather, stocks, sports, news, and maps.",
	"My husband and I live in US.",
	"I was messing with advanced voice the other day and I was like, 'Oh, I can do this.'",
	]

	inputs = [sentence.split() for sentence in sentences]

	data_collator = DataCollatorForTokenClassification(tokenizer)

	dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)

	dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)

	total_predictions = []
	for step, batch in enumerate(dataloader):
	batch = {k: v.to(model.device) for k, v in batch.items()}
	with torch.inference_mode():
	outputs = model(**batch)
	predictions = outputs.logits.argmax(-1)
	labels = batch["labels"]

	predictions = predictions.cpu().tolist()
	labels = labels.cpu().tolist()

	true_predictions = []
	for i, label in enumerate(labels):
	true_pred = []
	for j, m in enumerate(label):
	if m != -100:
	true_pred.append(id2label[predictions[i][j]])
	true_predictions.append(true_pred)
	total_predictions.extend(true_predictions)


	for word, pred in zip(inputs, total_predictions):
	for w, p in zip(word, pred):
	print(w, p)

	```

	## Citation
	```
	@article{dou2023reducing,
	title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
	author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
	journal={arXiv preprint arXiv:2311.09538},
	year={2023}
	}
	```