File size: 8,144 Bytes
f5f4b76 46fae04 f5f4b76 46fae04 f5f4b76 8fc2949 f5f4b76 8fc2949 acb5844 8fc2949 46fae04 8fc2949 f5f4b76 46fae04 2578f35 b211d4f 7ce6937 739c510 7ce6937 b2116cb 7ce6937 739c510 7ce6937 b2116cb 7ce6937 739c510 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 b2116cb 7ce6937 decf714 1c9818f decf714 7ce6937 1c9818f 7ce6937 decf714 7ce6937 1c9818f decf714 1c9818f decf714 1c9818f decf714 1c9818f decf714 1c9818f decf714 1c9818f 7ce6937 decf714 1c9818f decf714 1c9818f decf714 7ce6937 b2116cb 7ce6937 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
language:
- en
tags:
- redaction
- PII
- NLP
- privacy
license: apache-2.0
datasets:
- harryroy/ninja-redact-2-large
metrics:
- accuracy
- f1
model-index:
- name: Ninja-Masker-2-PII-Redaction
results:
- task:
name: PII Redaction
type: text-classification
dataset:
name: harryroy/ninja-redact-2-large
type: text
metrics:
- name: Accuracy
type: accuracy
value: 0.95 # Adjust based on your results
- name: F1 Score
type: f1
value: 0.92 # Adjust based on your results
widget:
- text: "John Doe's phone number is 123-456-7890 and his email is [email protected]."
example_title: "PII Redaction Example"
output:
text: "[REDACTED]'s phone number is [REDACTED] and his email is [REDACTED]."
---
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/kn93E3r_bwBeOzWcEo7gL.jpeg)
Ninja Masker 2
# Model Card: Ninja-Masker-2-PII-Redaction
## π§ Model Overview
**Model Name:** Ninja-Masker-2-PII-Redaction
**Model Type:** Language Model for PII Redaction
**License:** Apache 2.0
**Model Creator:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
**Model Repository:** [Hugging Face Hub - Ninja-Masker-2-PII-Redaction](https://huggingface.co/King-Harry/Ninja-Masker-2-PII-Redaction)
### π Model Description
Ninja-Masker-2-PII-Redaction is an updated fine-tuned language model designed to identify and redact Personally Identifiable Information (PII) from text data. The model is based on the Meta-Llama-3.1-8B architecture and has been fine-tuned on a dataset of over 30,000 input-output pairs to perform accurate PII masking using a set of predefined tags. It is nice and small and thus fairly cost efficient, yet powerful.
### π οΈ Preprocessing
The training data was formatted using a specific Alpaca-style prompt structure. Each prompt was paired with an instruction and input context, and the model was trained to generate the appropriate redacted output. The model was trained on a variety of PII types, including but not limited to names, email addresses, phone numbers, and credit card information.
### βοΈ Quantization and Optimization
To optimize performance and reduce memory usage, the model was fine-tuned using 4-bit quantization. Additional optimizations included the use of Flash Attention (Xformers) and gradient checkpointing, which allowed for efficient training and inference.
### π Training Details
- **Dataset:** HarryRoy/Ninja-Redact-2-large (Custom PII redaction dataset)
- **Training Environment:** Google Colab, NVIDIA A100 GPU
- **Training Framework:** PyTorch with Hugging Face Transformers, Unsloth
- **Training Configuration:**
- Max sequence length: 2048 tokens
- Batch size: 8
- Gradient accumulation steps: 4
- Learning rate: 1e-5
- Epochs: 1 (500 steps)
- Optimizer: AdamW 8-bit
### π Model Performance
The model was evaluated based on its ability to accurately redact PII from text while maintaining the original context and meaning. The fine-tuning process resulted in a model that effectively identifies and replaces PII with the appropriate tags in various text scenarios.
### π‘ Use Cases
- **Data Anonymization:** Useful for redacting PII in datasets before sharing or analysis.
- **Email and Document Redaction:** Can be integrated into email processing systems or document management workflows to automatically redact sensitive information.
- **Customer Support:** Enhances customer support systems by ensuring PII is automatically redacted in customer communications.
### β οΈ Limitations
- **Tag Set:** The model relies on a predefined set of tags for redaction. It may not recognize PII types outside of this set.
- **Context Dependence:** While the model performs well in most scenarios, its accuracy may decrease with highly complex or ambiguous input contexts.
- **Inference Speed:** Depending on the hardware, the model's inference speed may vary, especially for long sequences.
### βοΈ Ethical Considerations
The model is designed for responsible data management, ensuring that sensitive information is properly anonymized. However, users should be aware of the limitations and should not rely solely on automated redaction for highly sensitive data.
### π How to Use
To use this model, you can load it from the Hugging Face Hub and integrate it into your Python or API-based applications. Below is an example of how to load and use the model:
```python
# Install necessary packages from the unsloth GitHub repository and others required for model handling and optimizations.
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
# Import the necessary classes from the transformers and unsloth libraries.
from transformers import AutoModelForCausalLM, AutoTokenizer
from unsloth import FastLanguageModel
# Specify the name of the fine-tuned model hosted on Hugging Face Hub and load it along with its tokenizer.
# The model is loaded in 4-bit precision to optimize memory usage and speed.
model_name = "King-Harry/Ninja-Masker-2-PII-Redaction"
model, tokenizer = FastLanguageModel.from_pretrained(model_name, load_in_4bit=True)
# Prepare the model for inference mode, ensuring it's optimized for generating predictions.
FastLanguageModel.for_inference(model)
# Define a prompt template in the style of the Alpaca instruction-based format.
# This template will be used to format the input text for the model.
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Format the input text using the Alpaca-style prompt with a specific instruction and input.
# The tokenizer encodes the formatted text into tensors suitable for the model to process.
inputs = tokenizer(
[
alpaca_prompt.format(
"Replace all the PII from this text and use only the following tags: [FULLNAME], [NAME], [EMAIL], [CITY], [JOBAREA], [FIRSTNAME], [STATE], [STREETADDRESS], [URL], [USERNAME], [NUMBER], [JOBTITLE], [LASTNAME], [ACCOUNTNUMBER], [AMOUNT], [BUILDINGNUMBER], [ZIPCODE], [CURRENCY], [STREET], [PASSWORD], [IPV4], [CURRENCYNAME], [ACCOUNTNAME], [GENDER], [COUNTY], [CREDITCARDNUMBER], [DISPLAYNAME], [IPV6], [USERAGENT], [BITCOINADDRESS], [CURRENCYCODE], [JOBTYPE], [IBAN], [ETHEREUMADDRESS], [MAC], [IP], [CREDITCARDISSUER], [CREDITCARDCVV], [MASKEDNUMBER], [SEX], [JOBDESCRIPTOR]", # instruction
"Write an email to Kendra Harvey at [email protected] summarizing the key findings from a recent cognitive therapy conference they attended.", # input
"" # output - leave this blank for generation!
)
],
return_tensors="pt" # Return the encoded inputs as PyTorch tensors.
).to("cuda") # Move the tensors to the GPU (CUDA) for faster processing.
# Generate the model's output based on the input prompt, limiting the output to a maximum of 64 new tokens.
# The use_cache parameter is set to True to utilize past key values for faster generation.
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
# Decode the generated output from the model, converting the tokenized output back into human-readable text.
# The skip_special_tokens argument ensures that special tokens used by the model (like padding or start tokens) are omitted.
redacted_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Print the first item in the list of decoded texts, which should be the redacted version of the input text.
print(redacted_text[0])
```
### π Citation
If you use this model, please consider citing the model repository:
```bibtex
@misc{ninja_masker_2024,
author = {King Harry (Roy)},
title = {Ninja-Masker-2-PII-Redaction},
year = {2024},
url = {https://huggingface.co/King-Harry/Ninja-Masker-2-PII-Redaction},
}
```
|