File size: 5,757 Bytes
7ce6937
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c9818f
 
 
 
7ce6937
1c9818f
7ce6937
1c9818f
7ce6937
1c9818f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ce6937
1c9818f
 
 
7ce6937
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123


# Model Card: Ninja-Masker-2-PII-Redaction

## Model Overview

**Model Name:** Ninja-Masker-2-PII-Redaction  
**Model Type:** Language Model for PII Redaction  
**License:** Apache 2.0  
**Model Creator:** King Harry (Roy)  
**Model Repository:** [Hugging Face Hub - Ninja-Masker-2-PII-Redaction](https://huggingface.co/King-Harry/Ninja-Masker-2-PII-Redaction)  

### Model Description

Ninja-Masker-2-PII-Redaction is a fine-tuned language model designed to identify and redact Personally Identifiable Information (PII) from text data. The model is based on the Meta-Llama-3.1-8B architecture and has been fine-tuned on a dataset of over 30,000 input-output pairs to perform accurate PII masking using a set of predefined tags.

### Preprocessing

The training data was formatted using a specific Alpaca-style prompt structure. Each prompt was paired with an instruction and input context, and the model was trained to generate the appropriate redacted output. The model was trained on a variety of PII types, including but not limited to names, email addresses, phone numbers, and credit card information.

### Quantization and Optimization

To optimize performance and reduce memory usage, the model was fine-tuned using 4-bit quantization. Additional optimizations included the use of Flash Attention (Xformers) and gradient checkpointing, which allowed for efficient training and inference.

### Training Details

- **Dataset:** HarryRoy/Ninja-Redact-2-large (Custom PII redaction dataset)
- **Training Environment:** Google Colab, NVIDIA A100 GPU
- **Training Framework:** PyTorch with Hugging Face Transformers, Unsloth
- **Training Configuration:**
  - Max sequence length: 2048 tokens
  - Batch size: 8
  - Gradient accumulation steps: 4
  - Learning rate: 1e-5
  - Epochs: 1 (500 steps)
  - Optimizer: AdamW 8-bit

### Model Performance

The model was evaluated based on its ability to accurately redact PII from text while maintaining the original context and meaning. The fine-tuning process resulted in a model that effectively identifies and replaces PII with the appropriate tags in various text scenarios.

### Use Cases

- **Data Anonymization:** Useful for redacting PII in datasets before sharing or analysis.
- **Email and Document Redaction:** Can be integrated into email processing systems or document management workflows to automatically redact sensitive information.
- **Customer Support:** Enhances customer support systems by ensuring PII is automatically redacted in customer communications.

### Limitations

- **Tag Set:** The model relies on a predefined set of tags for redaction. It may not recognize PII types outside of this set.
- **Context Dependence:** While the model performs well in most scenarios, its accuracy may decrease with highly complex or ambiguous input contexts.
- **Inference Speed:** Depending on the hardware, the model's inference speed may vary, especially for long sequences.

### Ethical Considerations

The model is designed for responsible data management, ensuring that sensitive information is properly anonymized. However, users should be aware of the limitations and should not rely solely on automated redaction for highly sensitive data.

### How to Use

To use this model, you can load it from the Hugging Face Hub and integrate it into your Python or API-based applications. Below is an example of how to load and use the model:

```python
# Install necessary packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer
from unsloth import FastLanguageModel

# Load the fine-tuned model from Hugging Face Hub
model_name = "King-Harry/Ninja-Masker-2-PII-Redaction"
model, tokenizer = FastLanguageModel.from_pretrained(model_name, load_in_4bit=True)

# Ensure the model is ready for inference
FastLanguageModel.for_inference(model)

# Define the Alpaca-style prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Define the input text using the Alpaca prompt
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Replace all the PII from this text and use only the following tags: [FULLNAME], [NAME], [EMAIL], [CITY], [JOBAREA], [FIRSTNAME], [STATE], [STREETADDRESS], [URL], [USERNAME], [NUMBER], [JOBTITLE], [LASTNAME], [ACCOUNTNUMBER], [AMOUNT], [BUILDINGNUMBER], [ZIPCODE], [CURRENCY], [STREET], [PASSWORD], [IPV4], [CURRENCYNAME], [ACCOUNTNAME], [GENDER], [COUNTY], [CREDITCARDNUMBER], [DISPLAYNAME], [IPV6], [USERAGENT], [BITCOINADDRESS], [CURRENCYCODE], [JOBTYPE], [IBAN], [ETHEREUMADDRESS], [MAC], [IP], [CREDITCARDISSUER], [CREDITCARDCVV], [MASKEDNUMBER], [SEX], [JOBDESCRIPTOR]", # instruction
            "Write an email to Kendra Harvey at [email protected] summarizing the key findings from a recent cognitive therapy conference they attended.", # input
            ""  # output - leave this blank for generation!
        )
    ], 
    return_tensors="pt"
).to("cuda")

# Generate the redacted output
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode and print the output
redacted_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(redacted_text[0])

```

### Citation

If you use this model, please consider citing the model repository:

```bibtex
@misc{ninja_masker_2024,
  author = {King Harry (Roy)},
  title = {Ninja-Masker-2-PII-Redaction},
  year = {2024},
  url = {https://huggingface.co/King-Harry/Ninja-Masker-2-PII-Redaction},
}
```