File size: 4,068 Bytes
ab632e6
 
 
 
 
 
 
 
40f7877
 
12ccbef
40f7877
d28552d
 
 
 
 
 
ab632e6
 
12ccbef
62e1681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2018391
a7ba35a
2018391
a7ba35a
 
2018391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68fb4eb
 
40f7877
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- LLM
- token classification
- nlp
- safetensor
base_model: microsoft/Phi-3-mini-4k-instruct
library_name: transformers

widget:
- text: "My name is Sylvain and I live in Paris"
  example_title: "Parisian"
- text: "My name is Sarah and I live in London"
  example_title: "Londoner"
---


# PII Detection Model - Phi3 Mini Fine-Tuned

This repository contains a fine-tuned version of the [Phi3 Mini](https://huggingface.co/ab-ai/PII-Model-Phi3-Mini) model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations.

## Model Overview

### Model Architecture

- **Base Model**: Phi3 Mini
- **Fine-Tuned For**: PII detection
- **Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/)

### Detected PII Entities

The model is capable of detecting the following PII entities:

- **Personal Information**:
  - `firstname`
  - `middlename`
  - `lastname`
  - `sex`
  - `dob` (Date of Birth)
  - `age`
  - `gender`
  - `height`
  - `eyecolor`
  
- **Contact Information**:
  - `email`
  - `phonenumber`
  - `url`
  - `username`
  - `useragent`
  
- **Address Information**:
  - `street`
  - `city`
  - `state`
  - `county`
  - `zipcode`
  - `country`
  - `secondaryaddress`
  - `buildingnumber`
  - `ordinaldirection`
  
- **Geographical Information**:
  - `nearbygpscoordinate`
  
- **Organizational Information**:
  - `companyname`
  - `jobtitle`
  - `jobarea`
  - `jobtype`
  
- **Financial Information**:
  - `accountname`
  - `accountnumber`
  - `creditcardnumber`
  - `creditcardcvv`
  - `creditcardissuer`
  - `iban`
  - `bic`
  - `currency`
  - `currencyname`
  - `currencysymbol`
  - `currencycode`
  - `amount`
  
- **Unique Identifiers**:
  - `pin`
  - `ssn`
  - `imei` (Phone IMEI)
  - `mac` (MAC Address)
  - `vehiclevin` (Vehicle VIN)
  - `vehiclevrm` (Vehicle VRM)
  
- **Cryptocurrency Information**:
  - `bitcoinaddress`
  - `litecoinaddress`
  - `ethereumaddress`
  
- **Other Information**:
  - `ip` (IP Address)
  - `ipv4`
  - `ipv6`
  - `maskednumber`
  - `password`
  - `time`
  - `ordinaldirection`
  - `prefix`

## Usage

### Installation

To use this model, you'll need to have the `transformers` library installed:

```bash
pip install transformers
```

### Run Inference
```bash
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")


input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]."

model_prompt = f"""### Instruction:
    Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

    ### Input:
    {input_text}

    ### Output: """


inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
# adjust max_new_tokens according to your need
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) #{'middlename': ['Abner'], 'dob': ['23/03/1926'], 'email': ['[email protected]']}

```