mvansegbroeck's picture
Update README.md
4781986 verified
|
raw
history blame
5.36 kB
metadata
license: apache-2.0
language:
  - en
library_name: gliner
datasets:
  - gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
  - PII
  - PHI
  - GLiNER
  - information extraction
  - encoder
  - entity recognition
  - privacy

Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection

Gretel GLiNER is a specialized version of the GLiNER model, fine-tuned specifically for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types. We have fine-tuned the following GLiNER base models dataset to create specialized models for PII/PHI detection.

  • Fine-Tuned Models:
    • gretelai/gretel-gliner-bi-small-v1.0
    • gretelai/gretel-gliner-bi-base-v1.0
    • gretelai/gretel-gliner-bi-large-v1.0

For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the GLiNER Model Card.

The model was fine-tuned on the gretelai/gretel-pii-masking-en-v1 dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

  1. Training: Utilized the training split of the synthetic dataset.
  2. Validation: Monitored performance using the validation set to adjust training parameters.
  3. Evaluation: Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

For detailed statistics on the dataset, including domain and entity type distributions, visit the dataset documentation on Hugging Face.

Model Performance

The fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

Model Accuracy Precision Recall F1 Score
gretelai/gretel-gliner-bi-small-v1.0 0.89 0.98 0.91 0.94
gretelai/gretel-gliner-bi-base-v1.0 0.91 0.98 0.92 0.95
gretelai/gretel-gliner-bi-large-v1.0 0.91 0.99 0.93 0.95

Installation & Usage

Ensure you have Python installed. Then, install or update the gliner package:

pip install gliner -U

Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0")

# Sample text containing PII/PHI entities
text = """
"""

# Define the labels for PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.3
entities = model.predict_entities(text, labels, threshold=0.3)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

Expected Output:

John Doe => first_name
123-45-6789 => ssn
2023-04-15 => date
MRN-987654321 => medical_record_number
[email protected] => email

Use Cases

Gretel GLiNER is ideal for applications requiring precise detection and redaction of sensitive information:

  • Healthcare: Automating the extraction and redaction of patient information from medical records.
  • Finance: Identifying and securing financial data such as account numbers and transaction details.
  • Cybersecurity: Detecting sensitive information in logs and security reports.
  • Legal: Processing contracts and legal documents to protect client information.
  • Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

Citation and Usage

If you use this dataset in your research or applications, please cite it as:

@dataset{gretel-pii-masking-en-v1,
  author       = {Gretel AI},
  title        = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
  year         = {2024},
  month        = {10},
  publisher    = {Gretel},
  howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}

For questions, issues, or additional information, please visit our Synthetic Data Discord community or reach out to gretel.ai.