mvansegbroeck commited on
Commit
4781986
·
verified ·
1 Parent(s): f78b721

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -4
README.md CHANGED
@@ -1,10 +1,10 @@
1
- --
2
  license: apache-2.0
3
  language:
4
- - english
5
  library_name: gliner
6
  datasets:
7
- - gretel/synthetic_pii_docs_multidomain_en
8
  pipeline_tag: token-classification
9
  tags:
10
  - PII
@@ -14,4 +14,147 @@ tags:
14
  - encoder
15
  - entity recognition
16
  - privacy
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
  library_name: gliner
6
  datasets:
7
+ - gretelai/gretel-pii-masking-en-v1
8
  pipeline_tag: token-classification
9
  tags:
10
  - PII
 
14
  - encoder
15
  - entity recognition
16
  - privacy
17
+ ---
18
+
19
+ # Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
20
+ **Gretel GLiNER** is a specialized version of the GLiNER model, fine-tuned specifically for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI).
21
+ Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
22
+ We have fine-tuned the following GLiNER base models dataset to create specialized models for PII/PHI detection.
23
+
24
+ - **Fine-Tuned Models:**
25
+ - `gretelai/gretel-gliner-bi-small-v1.0`
26
+ - `gretelai/gretel-gliner-bi-base-v1.0`
27
+ - `gretelai/gretel-gliner-bi-large-v1.0`
28
+
29
+ For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-base-v1.0).
30
+
31
+ The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.
32
+
33
+ 1. **Training:** Utilized the training split of the synthetic dataset.
34
+ 2. **Validation:** Monitored performance using the validation set to adjust training parameters.
35
+ 3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.
36
+
37
+ For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretel/gretel-pii-masking-en-v1).
38
+
39
+ ### Model Performance
40
+
41
+ The fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:
42
+
43
+ | Model | Accuracy | Precision | Recall | F1 Score |
44
+ |---------------------------------------|----------|-----------|--------|----------|
45
+ | gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 |
46
+ | gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 |
47
+ | gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 |
48
+
49
+
50
+ ## Installation & Usage
51
+
52
+ Ensure you have Python installed. Then, install or update the `gliner` package:
53
+
54
+ ```bash
55
+ pip install gliner -U
56
+ ```
57
+
58
+ Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:
59
+
60
+ ```python
61
+ from gliner import GLiNER
62
+
63
+ # Load the fine-tuned GLiNER model
64
+ model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0")
65
+
66
+ # Sample text containing PII/PHI entities
67
+ text = """
68
+ """
69
+
70
+ # Define the labels for PII/PHI entities
71
+ labels = [
72
+ "medical_record_number",
73
+ "date_of_birth",
74
+ "ssn",
75
+ "date",
76
+ "first_name",
77
+ "email",
78
+ "last_name",
79
+ "customer_id",
80
+ "employee_id",
81
+ "name",
82
+ "street_address",
83
+ "phone_number",
84
+ "ipv4",
85
+ "credit_card_number",
86
+ "license_plate",
87
+ "address",
88
+ "user_name",
89
+ "device_identifier",
90
+ "bank_routing_number",
91
+ "date_time",
92
+ "company_name",
93
+ "unique_identifier",
94
+ "biometric_identifier",
95
+ "account_number",
96
+ "city",
97
+ "certificate_license_number",
98
+ "time",
99
+ "postcode",
100
+ "vehicle_identifier",
101
+ "coordinate",
102
+ "country",
103
+ "api_key",
104
+ "ipv6",
105
+ "password",
106
+ "health_plan_beneficiary_number",
107
+ "national_id",
108
+ "tax_id",
109
+ "url",
110
+ "state",
111
+ "swift_bic",
112
+ "cvv",
113
+ "pin"
114
+ ]
115
+
116
+ # Predict entities with a confidence threshold of 0.3
117
+ entities = model.predict_entities(text, labels, threshold=0.3)
118
+
119
+ # Display the detected entities
120
+ for entity in entities:
121
+ print(f"{entity['text']} => {entity['label']}")
122
+ ```
123
+
124
+ Expected Output:
125
+
126
+
127
+ ```
128
+ John Doe => first_name
129
+ 123-45-6789 => ssn
130
+ 2023-04-15 => date
131
+ MRN-987654321 => medical_record_number
132
+ [email protected] => email
133
+ ```
134
+
135
+ ## Use Cases
136
+
137
+ Gretel GLiNER is ideal for applications requiring precise detection and redaction of sensitive information:
138
+
139
+ - Healthcare: Automating the extraction and redaction of patient information from medical records.
140
+ - Finance: Identifying and securing financial data such as account numbers and transaction details.
141
+ - Cybersecurity: Detecting sensitive information in logs and security reports.
142
+ - Legal: Processing contracts and legal documents to protect client information.
143
+ - Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.
144
+
145
+ ## Citation and Usage
146
+
147
+ If you use this dataset in your research or applications, please cite it as:
148
+
149
+ ```bibtex
150
+ @dataset{gretel-pii-masking-en-v1,
151
+ author = {Gretel AI},
152
+ title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
153
+ year = {2024},
154
+ month = {10},
155
+ publisher = {Gretel},
156
+ howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
157
+ }
158
+ ```
159
+
160
+ For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).