mvansegbroeck's picture
Update README.md
b114d59 verified
|
raw
history blame
5.14 kB
metadata
license: apache-2.0
language:
  - en
library_name: gliner
datasets:
  - gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
  - PII
  - PHI
  - GLiNER
  - information extraction
  - encoder
  - entity recognition
  - privacy

Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection

This Gretel GLiNER model is a fine-tuned version of the GLiNER base model knowledgator/gliner-bi-large-v1.0, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types. For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the GLiNER Model Card.

The model was fine-tuned on the gretelai/gretel-pii-masking-en-v1 dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

  1. Training: Utilized the training split of the synthetic dataset.
  2. Validation: Monitored performance using the validation set to adjust training parameters.
  3. Evaluation: Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

For detailed statistics on the dataset, including domain and entity type distributions, visit the dataset documentation on Hugging Face.

Model Performance

All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

Model Accuracy Precision Recall F1 Score
gretelai/gretel-gliner-bi-small-v1.0 0.89 0.98 0.91 0.94
gretelai/gretel-gliner-bi-base-v1.0 0.91 0.98 0.92 0.95
gretelai/gretel-gliner-bi-large-v1.0 0.91 0.99 0.93 0.95

Installation & Usage

Ensure you have Python installed. Then, install or update the gliner package:

pip install gliner -U

Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-large-v1.0")

# Sample text containing PII/PHI entities
text = """
"""

# Define the labels for PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.3
entities = model.predict_entities(text, labels, threshold=0.3)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

Expected Output:

John Doe => first_name
123-45-6789 => ssn
2023-04-15 => date
MRN-987654321 => medical_record_number
[email protected] => email

Use Cases

Gretel GLiNER is ideal for applications requiring precise detection and redaction of sensitive information:

  • Healthcare: Automating the extraction and redaction of patient information from medical records.
  • Finance: Identifying and securing financial data such as account numbers and transaction details.
  • Cybersecurity: Detecting sensitive information in logs and security reports.
  • Legal: Processing contracts and legal documents to protect client information.
  • Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

Citation and Usage

If you use this dataset in your research or applications, please cite it as:

@dataset{gretel-pii-masking-en-v1,
  author       = {Gretel AI},
  title        = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
  year         = {2024},
  month        = {10},
  publisher    = {Gretel},
  howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}

For questions, issues, or additional information, please visit our Synthetic Data Discord community or reach out to gretel.ai.