|
--- |
|
language: |
|
- fr |
|
pipeline_tag: token-classification |
|
tags: |
|
- medical |
|
- ner |
|
- nlp |
|
- pseudonymisation |
|
license: bsd-3-clause |
|
library_name: edsnlp |
|
model-index: |
|
- name: AP-HP/eds-pseudo-public |
|
results: |
|
- task: |
|
type: token-classification |
|
dataset: |
|
name: AP-HP Pseudo Test |
|
type: private |
|
metrics: |
|
- type: precision |
|
name: Token Scores / ADRESSE / Precision |
|
value: 0.981694715087097 |
|
- type: recall |
|
name: Token Scores / ADRESSE / Recall |
|
value: 0.9693877551020401 |
|
- type: f1 |
|
name: Token Scores / ADRESSE / F1 |
|
value: 0.975502420419539 |
|
- type: recall |
|
name: Token Scores / ADRESSE / Redact |
|
value: 0.9763848396501451 |
|
- type: accuracy |
|
name: Token Scores / ADRESSE / Redact Full |
|
value: 0.9665697674418601 |
|
- type: precision |
|
name: Token Scores / DATE / Precision |
|
value: 0.9899177066870131 |
|
- type: recall |
|
name: Token Scores / DATE / Recall |
|
value: 0.984285249810339 |
|
- type: f1 |
|
name: Token Scores / DATE / F1 |
|
value: 0.9870934434692821 |
|
- type: recall |
|
name: Token Scores / DATE / Redact |
|
value: 0.9884035981359051 |
|
- type: accuracy |
|
name: Token Scores / DATE / Redact Full |
|
value: 0.859011627906976 |
|
- type: precision |
|
name: Token Scores / DATE_NAISSANCE / Precision |
|
value: 0.9753867791842471 |
|
- type: recall |
|
name: Token Scores / DATE_NAISSANCE / Recall |
|
value: 0.968913726859937 |
|
- type: f1 |
|
name: Token Scores / DATE_NAISSANCE / F1 |
|
value: 0.972139477834238 |
|
- type: recall |
|
name: Token Scores / DATE_NAISSANCE / Redact |
|
value: 0.9933636046105481 |
|
- type: accuracy |
|
name: Token Scores / DATE_NAISSANCE / Redact Full |
|
value: 0.9941860465116271 |
|
- type: precision |
|
name: Token Scores / IPP / Precision |
|
value: 0.918987341772151 |
|
- type: recall |
|
name: Token Scores / IPP / Recall |
|
value: 0.9075000000000001 |
|
- type: f1 |
|
name: Token Scores / IPP / F1 |
|
value: 0.9132075471698111 |
|
- type: recall |
|
name: Token Scores / IPP / Redact |
|
value: 0.985 |
|
- type: accuracy |
|
name: Token Scores / IPP / Redact Full |
|
value: 0.9927325581395341 |
|
- type: precision |
|
name: Token Scores / MAIL / Precision |
|
value: 0.9609144542772861 |
|
- type: recall |
|
name: Token Scores / MAIL / Recall |
|
value: 0.9977029096477791 |
|
- type: f1 |
|
name: Token Scores / MAIL / F1 |
|
value: 0.978963185574755 |
|
- type: recall |
|
name: Token Scores / MAIL / Redact |
|
value: 0.9977029096477791 |
|
- type: accuracy |
|
name: Token Scores / MAIL / Redact Full |
|
value: 0.9970930232558141 |
|
- type: precision |
|
name: Token Scores / NDA / Precision |
|
value: 0.921428571428571 |
|
- type: recall |
|
name: Token Scores / NDA / Recall |
|
value: 0.834951456310679 |
|
- type: f1 |
|
name: Token Scores / NDA / F1 |
|
value: 0.8760611205432931 |
|
- type: recall |
|
name: Token Scores / NDA / Redact |
|
value: 0.87378640776699 |
|
- type: accuracy |
|
name: Token Scores / NDA / Redact Full |
|
value: 0.9723837209302321 |
|
- type: precision |
|
name: Token Scores / NOM / Precision |
|
value: 0.9439770896724531 |
|
- type: recall |
|
name: Token Scores / NOM / Recall |
|
value: 0.9525013545241101 |
|
- type: f1 |
|
name: Token Scores / NOM / F1 |
|
value: 0.948220064724919 |
|
- type: recall |
|
name: Token Scores / NOM / Redact |
|
value: 0.981578472096803 |
|
- type: accuracy |
|
name: Token Scores / NOM / Redact Full |
|
value: 0.895348837209302 |
|
- type: precision |
|
name: Token Scores / PRENOM / Precision |
|
value: 0.9348837209302321 |
|
- type: recall |
|
name: Token Scores / PRENOM / Recall |
|
value: 0.9663461538461531 |
|
- type: f1 |
|
name: Token Scores / PRENOM / F1 |
|
value: 0.950354609929078 |
|
- type: recall |
|
name: Token Scores / PRENOM / Redact |
|
value: 0.99002849002849 |
|
- type: accuracy |
|
name: Token Scores / PRENOM / Redact Full |
|
value: 0.9316860465116271 |
|
- type: precision |
|
name: Token Scores / SECU / Precision |
|
value: 0.882838283828382 |
|
- type: recall |
|
name: Token Scores / SECU / Recall |
|
value: 1 |
|
- type: f1 |
|
name: Token Scores / SECU / F1 |
|
value: 0.9377738825591581 |
|
- type: recall |
|
name: Token Scores / SECU / Redact |
|
value: 1 |
|
- type: accuracy |
|
name: Token Scores / SECU / Redact Full |
|
value: 1.0 |
|
- type: precision |
|
name: Token Scores / TEL / Precision |
|
value: 0.9746407438715131 |
|
- type: recall |
|
name: Token Scores / TEL / Recall |
|
value: 0.9993932564791541 |
|
- type: f1 |
|
name: Token Scores / TEL / F1 |
|
value: 0.9868618136688491 |
|
- type: recall |
|
name: Token Scores / TEL / Redact |
|
value: 0.999479934124989 |
|
- type: accuracy |
|
name: Token Scores / TEL / Redact Full |
|
value: 0.99563953488372 |
|
- type: precision |
|
name: Token Scores / VILLE / Precision |
|
value: 0.96684350132626 |
|
- type: recall |
|
name: Token Scores / VILLE / Recall |
|
value: 0.9376205787781351 |
|
- type: f1 |
|
name: Token Scores / VILLE / F1 |
|
value: 0.9520078354554351 |
|
- type: recall |
|
name: Token Scores / VILLE / Redact |
|
value: 0.9511254019292601 |
|
- type: accuracy |
|
name: Token Scores / VILLE / Redact Full |
|
value: 0.9113372093023251 |
|
- type: precision |
|
name: Token Scores / ZIP / Precision |
|
value: 0.9675036927621861 |
|
- type: recall |
|
name: Token Scores / ZIP / Recall |
|
value: 1 |
|
- type: f1 |
|
name: Token Scores / ZIP / F1 |
|
value: 0.983483483483483 |
|
- type: recall |
|
name: Token Scores / ZIP / Redact |
|
value: 1 |
|
- type: accuracy |
|
name: Token Scores / ZIP / Redact Full |
|
value: 1.0 |
|
- type: precision |
|
name: Token Scores / micro / Precision |
|
value: 0.970393736698084 |
|
- type: recall |
|
name: Token Scores / micro / Recall |
|
value: 0.9783320880510371 |
|
- type: f1 |
|
name: Token Scores / micro / F1 |
|
value: 0.9743467434960551 |
|
- type: recall |
|
name: Token Scores / micro / Redact |
|
value: 0.9884667701208881 |
|
- type: accuracy |
|
name: Token Scores / micro / Redact Full |
|
value: 0.6308139534883721 |
|
extra_gated_fields: |
|
Organisation: text |
|
Intended use of the model: |
|
type: select |
|
options: |
|
- NLP Research |
|
- Education |
|
- Commercial Product |
|
- Clinical Data Warehouse |
|
- label: Other |
|
value: other |
|
--- |
|
<div> |
|
|
|
[<img style="display: inline" src="https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/tests.yml?branch=main&label=tests&style=flat-square" alt="Tests">]() |
|
[<img style="display: inline" src="https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/documentation.yml?branch=main&label=docs&style=flat-square" alt="Documentation">](https://aphp.github.io/eds-pseudo/latest/) |
|
[<img style="display: inline" src="https://img.shields.io/codecov/c/github/aphp/eds-pseudo?logo=codecov&style=flat-square" alt="Codecov">](https://codecov.io/gh/aphp/eds-pseudo) |
|
[<img style="display: inline" src="https://img.shields.io/badge/repro-poetry-blue?style=flat-square" alt="Poetry">](https://python-poetry.org) |
|
[<img style="display: inline" src="https://img.shields.io/badge/repro-dvc-blue?style=flat-square" alt="DVC">](https://dvc.org) |
|
[<img style="display: inline" src="https://img.shields.io/badge/demo%20%F0%9F%9A%80-streamlit-purple?style=flat-square" alt="Demo">](https://eds-pseudo-public.streamlit.app/) |
|
|
|
</div> |
|
|
|
# EDS-Pseudo |
|
|
|
This project aims at detecting identifying entities documents, and was primarily tested |
|
on clinical reports at AP-HP's Clinical Data Warehouse (EDS). |
|
|
|
The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a |
|
hybrid model (rule-based + deep learning) for which we provide |
|
rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes)) |
|
and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py). |
|
|
|
We also provide some fictitious |
|
templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to |
|
generate a synthetic |
|
dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py). |
|
|
|
The entities that are detected are listed below. |
|
|
|
| Label | Description | |
|
|------------------|---------------------------------------------------------------| |
|
| `ADRESSE` | Street address, eg `33 boulevard de Picpus` | |
|
| `DATE` | Any absolute date other than a birthdate | |
|
| `DATE_NAISSANCE` | Birthdate | |
|
| `HOPITAL` | Hospital name, eg `Hôpital Rothschild` | |
|
| `IPP` | Internal AP-HP identifier for patients, displayed as a number | |
|
| `MAIL` | Email address | |
|
| `NDA` | Internal AP-HP identifier for visits, displayed as a number | |
|
| `NOM` | Any last name (patients, doctors, third parties) | |
|
| `PRENOM` | Any first name (patients, doctors, etc) | |
|
| `SECU` | Social security number | |
|
| `TEL` | Any phone number | |
|
| `VILLE` | Any city | |
|
| `ZIP` | Any zip code | |
|
|
|
## Downloading the public pre-trained model |
|
|
|
The public pretrained model is available on the HuggingFace model hub at |
|
[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data |
|
(see [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py)). You can also |
|
test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**. |
|
|
|
1. Install the latest version of edsnlp |
|
|
|
```shell |
|
pip install "edsnlp[ml]" -U |
|
``` |
|
|
|
2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) |
|
3. Create and copy a huggingface token with permission **"READ"** at https://huggingface.co/settings/tokens?new_token=true |
|
4. Register the token (only once) on your machine |
|
|
|
```python |
|
import huggingface_hub |
|
|
|
huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True) |
|
``` |
|
|
|
5. Load the model |
|
|
|
```python |
|
import edsnlp |
|
|
|
nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True) |
|
doc = nlp( |
|
"En 2015, M. Charles-François-Bienvenu " |
|
"Myriel était évêque de Digne. C’était un vieillard " |
|
"d’environ soixante-quinze ans ; il occupait le " |
|
"siège de Digne depuis 2006." |
|
) |
|
|
|
for ent in doc.ents: |
|
print(ent, ent.label_, str(ent._.date)) |
|
``` |
|
|
|
To apply the model on many documents using one or more GPUs, refer to the documentation |
|
of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/). |
|
|
|
## Metrics |
|
|
|
| AP-HP Pseudo Test Token Scores | Precision | Recall | F1 | Redact | Redact Full | |
|
|:---------------------------------|------------:|---------:|-----:|---------:|--------------:| |
|
| ADRESSE | 98.2 | 96.9 | 97.6 | 97.6 | 96.7 | |
|
| DATE | 99 | 98.4 | 98.7 | 98.8 | 85.9 | |
|
| DATE_NAISSANCE | 97.5 | 96.9 | 97.2 | 99.3 | 99.4 | |
|
| IPP | 91.9 | 90.8 | 91.3 | 98.5 | 99.3 | |
|
| MAIL | 96.1 | 99.8 | 97.9 | 99.8 | 99.7 | |
|
| NDA | 92.1 | 83.5 | 87.6 | 87.4 | 97.2 | |
|
| NOM | 94.4 | 95.3 | 94.8 | 98.2 | 89.5 | |
|
| PRENOM | 93.5 | 96.6 | 95 | 99 | 93.2 | |
|
| SECU | 88.3 | 100 | 93.8 | 100 | 100 | |
|
| TEL | 97.5 | 99.9 | 98.7 | 99.9 | 99.6 | |
|
| VILLE | 96.7 | 93.8 | 95.2 | 95.1 | 91.1 | |
|
| ZIP | 96.8 | 100 | 98.3 | 100 | 100 | |
|
| micro | 97 | 97.8 | 97.4 | 98.8 | 63.1 | |
|
|
|
## Installation to reproduce |
|
|
|
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it: |
|
|
|
```shell |
|
git clone https://github.com/aphp/eds-pseudo.git |
|
cd eds-pseudo |
|
``` |
|
|
|
And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager |
|
like [Poetry](https://python-poetry.org/). |
|
|
|
```shell |
|
poetry install |
|
``` |
|
|
|
## How to use without machine learning |
|
|
|
```python |
|
import edsnlp |
|
|
|
nlp = edsnlp.blank("eds") |
|
|
|
# Some text cleaning |
|
nlp.add_pipe("eds.normalizer") |
|
|
|
# Various simple rules |
|
nlp.add_pipe( |
|
"eds_pseudo.simple_rules", |
|
config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]}, |
|
) |
|
|
|
# Address detection |
|
nlp.add_pipe("eds_pseudo.addresses") |
|
|
|
# Date detection |
|
nlp.add_pipe("eds_pseudo.dates") |
|
|
|
# Contextual rules (requires a dict of info about the patient) |
|
nlp.add_pipe("eds_pseudo.context") |
|
|
|
# Apply it to a text |
|
doc = nlp( |
|
"En 2015, M. Charles-François-Bienvenu " |
|
"Myriel était évêque de Digne. C’était un vieillard " |
|
"d’environ soixante-quinze ans ; il occupait le " |
|
"siège de Digne depuis 2006." |
|
) |
|
|
|
for ent in doc.ents: |
|
print(ent, ent.label_) |
|
|
|
# 2015 DATE |
|
# Charles-François-Bienvenu NOM |
|
# Myriel PRENOM |
|
# 2006 DATE |
|
``` |
|
|
|
## How to train |
|
|
|
Before training a model, you should update the |
|
[configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) and |
|
[pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) files to |
|
fit your needs. |
|
|
|
Put your data in the `data/dataset` folder (or edit the paths `configs/config.cfg` file to point |
|
to `data/gen_dataset/train.jsonl`). |
|
|
|
Then, run the training script |
|
|
|
```shell |
|
python scripts/train.py --config configs/config.cfg --seed 43 |
|
``` |
|
|
|
This will train a model and save it in `artifacts/model-last`. You can evaluate it on the test set (defaults |
|
to `data/dataset/test.jsonl`) with: |
|
|
|
```shell |
|
python scripts/evaluate.py --config configs/config.cfg |
|
``` |
|
|
|
To package it, run: |
|
|
|
```shell |
|
python scripts/package.py |
|
``` |
|
|
|
This will create a `dist/eds-pseudo-aphp-***.whl` file that you can install with `pip install dist/eds-pseudo-aphp-***`. |
|
|
|
You can use it in your code: |
|
|
|
```python |
|
import edsnlp |
|
|
|
# Either from the model path directly |
|
nlp = edsnlp.load("artifacts/model-last") |
|
|
|
# Or from the wheel file |
|
import eds_pseudo_aphp |
|
|
|
nlp = eds_pseudo_aphp.load() |
|
``` |
|
|
|
## Documentation |
|
|
|
Visit the [documentation](https://aphp.github.io/eds-pseudo/) for more information! |
|
|
|
## Publication |
|
|
|
Please find our publication at the following link: https://doi.org/mkfv. |
|
|
|
If you use EDS-Pseudo, please cite us as below: |
|
|
|
``` |
|
@article{eds_pseudo, |
|
title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse}, |
|
author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain}, |
|
journal={Methods of Information in Medicine}, |
|
year={2024}, |
|
publisher={Georg Thieme Verlag KG} |
|
} |
|
``` |
|
|
|
## Acknowledgement |
|
|
|
We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/) |
|
and [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project. |
|
|