Update README.md
Browse files
README.md
CHANGED
@@ -22,11 +22,28 @@ GLiNER is a Named Entity Recognition (NER) model capable of identifying any enti
|
|
22 |
This model has been trained by fine-tuning `urchade/gliner_multi_pii-v1` on the synthetic dataset covering PPIs for the domains: `healthcare`, `finance`, `legal`, `banking` and `general`.
|
23 |
|
24 |
This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social security number`.
|
25 |
-
|
26 |
|
27 |
-
## English example
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
text = """
|
31 |
Medical Record
|
32 |
|
@@ -46,17 +63,20 @@ Next Examination Date:
|
|
46 |
15-11-2024
|
47 |
"""
|
48 |
|
49 |
-
#
|
|
|
50 |
labels = ["name", "social security number", "date of birth", "date"]
|
51 |
|
52 |
-
#
|
53 |
-
entities =
|
54 |
|
55 |
-
#
|
56 |
for entity in entities:
|
57 |
print(entity["text"], "=>", entity["label"])
|
58 |
```
|
59 |
|
|
|
|
|
60 |
```text
|
61 |
John Doe => name
|
62 |
15-01-1985 => date of birth
|
@@ -66,9 +86,17 @@ John Doe => name
|
|
66 |
15-11-2024 => date
|
67 |
```
|
68 |
|
69 |
-
|
|
|
|
|
70 |
|
71 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
text = """
|
73 |
Medisch dossier
|
74 |
|
@@ -89,17 +117,20 @@ Volgende onderzoekdatum:
|
|
89 |
15-11-2024
|
90 |
"""
|
91 |
|
92 |
-
#
|
|
|
93 |
labels = ["naam", "bmurgerservicenummer", "geboortedatum", "datum"]
|
94 |
|
95 |
-
#
|
96 |
-
entities =
|
97 |
|
98 |
-
#
|
99 |
for entity in entities:
|
100 |
print(entity["text"], "=>", entity["label"])
|
101 |
```
|
102 |
|
|
|
|
|
103 |
```text
|
104 |
Jan de Vries => naam
|
105 |
15-01-1985 => geboortedatum
|
@@ -107,4 +138,8 @@ Jan de Vries => naam
|
|
107 |
987-65-4321 => bmurgerservicenummer
|
108 |
Jan de Vries => naam
|
109 |
15-11-2024 => datum
|
110 |
-
```
|
|
|
|
|
|
|
|
|
|
22 |
This model has been trained by fine-tuning `urchade/gliner_multi_pii-v1` on the synthetic dataset covering PPIs for the domains: `healthcare`, `finance`, `legal`, `banking` and `general`.
|
23 |
|
24 |
This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social security number`.
|
|
|
25 |
|
|
|
26 |
|
27 |
+
## Usage
|
28 |
+
|
29 |
+
To use the model, one must use the [GLiNER](https://github.com/urchade/GLiNER) library. Once installed, the user can load the model and use it to discern the entities within the text.
|
30 |
+
|
31 |
+
```bash
|
32 |
+
pip install gliner
|
33 |
+
```
|
34 |
+
|
35 |
+
What follows are some examples of its intended use.
|
36 |
+
|
37 |
+
|
38 |
+
### Extract entities from English medical text
|
39 |
+
|
40 |
```python
|
41 |
+
from gliner import GLiNER
|
42 |
+
|
43 |
+
# initialize the GLiNER using this model
|
44 |
+
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
|
45 |
+
|
46 |
+
# prepare the text for entity extraction
|
47 |
text = """
|
48 |
Medical Record
|
49 |
|
|
|
63 |
15-11-2024
|
64 |
"""
|
65 |
|
66 |
+
# prepare the labels/entities to be extracted
|
67 |
+
# this model should work best when entity types are in lowercase
|
68 |
labels = ["name", "social security number", "date of birth", "date"]
|
69 |
|
70 |
+
# perform entity extraction
|
71 |
+
entities = model.predict_entities(text, labels, threshold=0.5)
|
72 |
|
73 |
+
# display predicted entities and their labels
|
74 |
for entity in entities:
|
75 |
print(entity["text"], "=>", entity["label"])
|
76 |
```
|
77 |
|
78 |
+
**Expected output**
|
79 |
+
|
80 |
```text
|
81 |
John Doe => name
|
82 |
15-01-1985 => date of birth
|
|
|
86 |
15-11-2024 => date
|
87 |
```
|
88 |
|
89 |
+
|
90 |
+
|
91 |
+
### Extract entities from Dutch medical text
|
92 |
|
93 |
```python
|
94 |
+
from gliner import GLiNER
|
95 |
+
|
96 |
+
# initialize the GLiNER using this model
|
97 |
+
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
|
98 |
+
|
99 |
+
# prepare the text for entity extraction
|
100 |
text = """
|
101 |
Medisch dossier
|
102 |
|
|
|
117 |
15-11-2024
|
118 |
"""
|
119 |
|
120 |
+
# prepare the labels/entities to be extracted
|
121 |
+
# this model should work best when entity types are in lowercase
|
122 |
labels = ["naam", "bmurgerservicenummer", "geboortedatum", "datum"]
|
123 |
|
124 |
+
# perform entity extraction
|
125 |
+
entities = model.predict_entities(text, labels, threshold=0.2)
|
126 |
|
127 |
+
# display predicted entities and their labels
|
128 |
for entity in entities:
|
129 |
print(entity["text"], "=>", entity["label"])
|
130 |
```
|
131 |
|
132 |
+
**Expected output**
|
133 |
+
|
134 |
```text
|
135 |
Jan de Vries => naam
|
136 |
15-01-1985 => geboortedatum
|
|
|
138 |
987-65-4321 => bmurgerservicenummer
|
139 |
Jan de Vries => naam
|
140 |
15-11-2024 => datum
|
141 |
+
```
|
142 |
+
|
143 |
+
## Aknowledgements
|
144 |
+
|
145 |
+
Funded by the European Union. UK participants in Horizon Europe Project PREPARE are supported by UKRI grant number 10086219 (Trilateral Research). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA) or UKRI. Neither the European Union nor the granting authority nor UKRI can be held responsible for them. Grant Agreement 101080288 PREPARE HORIZON-HLTH-2022-TOOL-12-01.
|