Update README.md
Browse files
README.md
CHANGED
@@ -6,4 +6,124 @@ pipeline_tag: fill-mask
|
|
6 |
tags:
|
7 |
- medical
|
8 |
- clinical
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
tags:
|
7 |
- medical
|
8 |
- clinical
|
9 |
+
datasets:
|
10 |
+
- rntc/biomed-fr
|
11 |
+
---
|
12 |
+
|
13 |
+
<a href=https://camembert-model.fr>
|
14 |
+
<img width="300px" src="https://camembert-model.fr/authors/admin/avatar_huac8a9374dbd7d6a2cb77224540858ab4_463389_270x270_fill_lanczos_center_3.png">
|
15 |
+
</a>
|
16 |
+
|
17 |
+
# CamemBERT-bio : a Tasty French Language Model Better for your Health
|
18 |
+
|
19 |
+
CamemBERT-bio is a state-of-the-art french biomedical language model built using continual-pretraining from [camembert-base](https://huggingface.co/camembert-base).
|
20 |
+
It was trained on a french public biomedical corpus of 413M words containing scientific documments, drug leaflets and clinical cases extrated from theses and articles.
|
21 |
+
It shows 2.54 points of F1 score improvement on average on 5 different biomedical named entity recognition tasks compared to [camembert-base](https://huggingface.co/camembert-base).
|
22 |
+
|
23 |
+
## Absract
|
24 |
+
|
25 |
+
Clinical data in hospitals are increasingly accessible for research through clinical data warehouses, however these documents are unstructured. It is therefore necessary to extract information from medical
|
26 |
+
reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT
|
27 |
+
has allowed major advances, especially for named entity recognition. However, these models are
|
28 |
+
trained for plain language and are less efficient on biomedical data. This is why we propose a new
|
29 |
+
french public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus,
|
30 |
+
we introduce a first version of CamemBERT-bio, a specialized public model for the french biomedical
|
31 |
+
domain that shows 2.54 points of F1 score improvement on average on different biomedical named
|
32 |
+
entity recognition tasks.
|
33 |
+
|
34 |
+
- **Developed by:** Rian Touchent, Eric Villemonte de La Clergerie
|
35 |
+
- **License:** MIT
|
36 |
+
|
37 |
+
!### Model Sources [optional]
|
38 |
+
|
39 |
+
<!-- Provide the basic links for the model. -->
|
40 |
+
|
41 |
+
<!-- - **Website:** camembert-bio-model.fr -->
|
42 |
+
<!-- - **Paper [optional]:** [More Information Needed] -->
|
43 |
+
<!-- - **Demo [optional]:** [More Information Needed] -->
|
44 |
+
|
45 |
+
|
46 |
+
## Training Details
|
47 |
+
|
48 |
+
### Training Data
|
49 |
+
|
50 |
+
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
51 |
+
|
52 |
+
| **Corpus** | **Details** | **Size** |
|
53 |
+
|------------|--------------------------------------------------------------------|------------|
|
54 |
+
| ISTEX | Divers documents de la littérature scientifique indexés sur ISTEX | 276 M |
|
55 |
+
| CLEAR | Notices de médicaments | 73 M |
|
56 |
+
| E3C | Divers documents issus de journaux, de notices et de cas cliniques | 64 M |
|
57 |
+
| Total | | 413 M |
|
58 |
+
|
59 |
+
|
60 |
+
### Training Procedure
|
61 |
+
|
62 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
63 |
+
|
64 |
+
We used continual-pretraining from [camembert-base](https://huggingface.co/camembert-base).
|
65 |
+
We trained the model using the Masked Language Modeling (MLM) objective with Whole Word Masking for 50k steps during 39 hours
|
66 |
+
with 2 Tesla V100.
|
67 |
+
|
68 |
+
## Evaluation
|
69 |
+
|
70 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
71 |
+
|
72 |
+
### Fine-tuning
|
73 |
+
|
74 |
+
For fine-tuning, we utilized Optuna to select the hyperparameters.
|
75 |
+
The learning rate was set to 5e-5, with a warmup ratio of 0.224 and a batch size of 16.
|
76 |
+
The fine-tuning process was carried out for 2000 steps.
|
77 |
+
For prediction, a simple linear layer was added on top of the model.
|
78 |
+
Notably, none of the CamemBERT layers were frozen during the fine-tuning process.
|
79 |
+
|
80 |
+
### Scoring
|
81 |
+
|
82 |
+
To evaluate the performance of the model, we used the seqeval tool in strict mode with the IOB2 scheme.
|
83 |
+
For each evaluation, the best fine-tuned model on the validation set was selected to calculate the final score on the test set.
|
84 |
+
To ensure reliability, we averaged over 10 evaluations with different seeds.
|
85 |
+
|
86 |
+
### Results
|
87 |
+
|
88 |
+
| Style | Dataset | Score | CamemBERT | CamemBERT |
|
89 |
+
| :----------- | :------ | :---- | :---------------: | :-------------------: |
|
90 |
+
| Clinique | CAS1 | F1 | 70\.50 ~~±~~ 1.75 | **73\.03 ~~±~~ 1.29** |
|
91 |
+
| | | P | 70\.12 ~~±~~ 1.93 | 71\.71 ~~±~~ 1.61 |
|
92 |
+
| | | R | 70\.89 ~~±~~ 1.78 | **74\.42 ~~±~~ 1.49** |
|
93 |
+
| | CAS2 | F1 | 79\.02 ~~±~~ 0.92 | **81\.66 ~~±~~ 0.59** |
|
94 |
+
| | | P | 77\.3 ~~±~~ 1.36 | **80\.96 ~~±~~ 0.91** |
|
95 |
+
| | | R | 80\.83 ~~±~~ 0.96 | **82\.37 ~~±~~ 0.69** |
|
96 |
+
| | E3C | F1 | 67\.63 ~~±~~ 1.45 | **69\.85 ~~±~~ 1.58** |
|
97 |
+
| | | P | 78\.19 ~~±~~ 0.72 | **79\.11 ~~±~~ 0.42** |
|
98 |
+
| | | R | 59\.61 ~~±~~ 2.25 | **62\.56 ~~±~~ 2.50** |
|
99 |
+
| Notices | EMEA | F1 | 74\.14 ~~±~~ 1.95 | **76\.71 ~~±~~ 1.50** |
|
100 |
+
| | | P | 74\.62 ~~±~~ 1.97 | **76\.92 ~~±~~ 1.96** |
|
101 |
+
| | | R | 73\.68 ~~±~~ 2.22 | **76\.52 ~~±~~ 1.62** |
|
102 |
+
| Scientifique | MEDLINE | F1 | 65\.73 ~~±~~ 0.40 | **68\.47 ~~±~~ 0.54** |
|
103 |
+
| | | P | 64\.94 ~~±~~ 0.82 | **67\.77 ~~±~~ 0.88** |
|
104 |
+
| | | R | 66\.56 ~~±~~ 0.56 | **69\.21 ~~±~~ 1.32** |
|
105 |
+
|
106 |
+
|
107 |
+
## Environmental Impact
|
108 |
+
|
109 |
+
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
110 |
+
|
111 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
112 |
+
|
113 |
+
- **Hardware Type:** 2 x Tesla V100
|
114 |
+
- **Hours used:** 39 hours
|
115 |
+
- **Provider:** INRIA clusters
|
116 |
+
- **Compute Region:** Paris, France
|
117 |
+
- **Carbon Emitted:** 0.84 kg CO2 eq.
|
118 |
+
|
119 |
+
<!-- ## Citation [optional] -->
|
120 |
+
|
121 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
122 |
+
|
123 |
+
<!-- **BibTeX:** -->
|
124 |
+
|
125 |
+
<!-- [More Information Needed] -->
|
126 |
+
|
127 |
+
<!-- **APA:** -->
|
128 |
+
|
129 |
+
<!-- [More Information Needed] -->
|