File size: 5,440 Bytes
4781986
f78b721
 
4781986
f78b721
 
4781986
f78b721
 
 
 
 
 
 
 
 
4781986
 
 
b114d59
4781986
b114d59
4781986
 
 
 
 
 
 
 
 
 
 
b114d59
4781986
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b114d59
4781986
 
 
e19e15e
 
 
 
 
 
 
 
4781986
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e19e15e
 
4781986
 
 
 
 
 
 
 
 
 
e19e15e
 
 
 
 
 
 
 
4781986
 
 
 
e19e15e
4781986
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: apache-2.0
language:
  - en
library_name: gliner
datasets:
  - gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
  - PII
  - PHI
  - GLiNER
  - information extraction
  - encoder
  - entity recognition
  - privacy
---

# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-large-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). 
Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-large-v1.0).

The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

1. **Training:** Utilized the training split of the synthetic dataset.
2. **Validation:** Monitored performance using the validation set to adjust training parameters.
3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretel/gretel-pii-masking-en-v1).

### Model Performance

All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

| Model                                 | Accuracy | Precision | Recall | F1 Score |
|---------------------------------------|----------|-----------|--------|----------|
| gretelai/gretel-gliner-bi-small-v1.0   | 0.89     | 0.98      | 0.91   | 0.94     |
| gretelai/gretel-gliner-bi-base-v1.0    | 0.91     | 0.98      | 0.92   | 0.95     |
| gretelai/gretel-gliner-bi-large-v1.0   | 0.91     | 0.99      | 0.93   | 0.95     |


## Installation & Usage

Ensure you have Python installed. Then, install or update the `gliner` package:

```bash
pip install gliner -U
```

Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

```python
from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-large-v1.0")

# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: [email protected]
"""

# Define the labels for PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")
```

Expected Output:


```
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
[email protected] => email
```

## Use Cases

Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:

- Healthcare: Automating the extraction and redaction of patient information from medical records.
- Finance: Identifying and securing financial data such as account numbers and transaction details.
- Cybersecurity: Detecting sensitive information in logs and security reports.
- Legal: Processing contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

## Citation and Usage

If you use this dataset in your research or applications, please cite it as:

```bibtex
@dataset{gretel-pii-masking-en-v1,
  author       = {Gretel AI},
  title        = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
  year         = {2024},
  month        = {10},
  publisher    = {Gretel},
  howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
```

For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).