rntc commited on
Commit
a15a057
·
1 Parent(s): 4876b1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -1
README.md CHANGED
@@ -6,4 +6,124 @@ pipeline_tag: fill-mask
6
  tags:
7
  - medical
8
  - clinical
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - medical
8
  - clinical
9
+ datasets:
10
+ - rntc/biomed-fr
11
+ ---
12
+
13
+ <a href=https://camembert-model.fr>
14
+ <img width="300px" src="https://camembert-model.fr/authors/admin/avatar_huac8a9374dbd7d6a2cb77224540858ab4_463389_270x270_fill_lanczos_center_3.png">
15
+ </a>
16
+
17
+ # CamemBERT-bio : a Tasty French Language Model Better for your Health
18
+
19
+ CamemBERT-bio is a state-of-the-art french biomedical language model built using continual-pretraining from [camembert-base](https://huggingface.co/camembert-base).
20
+ It was trained on a french public biomedical corpus of 413M words containing scientific documments, drug leaflets and clinical cases extrated from theses and articles.
21
+ It shows 2.54 points of F1 score improvement on average on 5 different biomedical named entity recognition tasks compared to [camembert-base](https://huggingface.co/camembert-base).
22
+
23
+ ## Absract
24
+
25
+ Clinical data in hospitals are increasingly accessible for research through clinical data warehouses, however these documents are unstructured. It is therefore necessary to extract information from medical
26
+ reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT
27
+ has allowed major advances, especially for named entity recognition. However, these models are
28
+ trained for plain language and are less efficient on biomedical data. This is why we propose a new
29
+ french public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus,
30
+ we introduce a first version of CamemBERT-bio, a specialized public model for the french biomedical
31
+ domain that shows 2.54 points of F1 score improvement on average on different biomedical named
32
+ entity recognition tasks.
33
+
34
+ - **Developed by:** Rian Touchent, Eric Villemonte de La Clergerie
35
+ - **License:** MIT
36
+
37
+ !### Model Sources [optional]
38
+
39
+ <!-- Provide the basic links for the model. -->
40
+
41
+ <!-- - **Website:** camembert-bio-model.fr -->
42
+ <!-- - **Paper [optional]:** [More Information Needed] -->
43
+ <!-- - **Demo [optional]:** [More Information Needed] -->
44
+
45
+
46
+ ## Training Details
47
+
48
+ ### Training Data
49
+
50
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
51
+
52
+ | **Corpus** | **Details** | **Size** |
53
+ |------------|--------------------------------------------------------------------|------------|
54
+ | ISTEX | Divers documents de la littérature scientifique indexés sur ISTEX | 276 M |
55
+ | CLEAR | Notices de médicaments | 73 M |
56
+ | E3C | Divers documents issus de journaux, de notices et de cas cliniques | 64 M |
57
+ | Total | | 413 M |
58
+
59
+
60
+ ### Training Procedure
61
+
62
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
63
+
64
+ We used continual-pretraining from [camembert-base](https://huggingface.co/camembert-base).
65
+ We trained the model using the Masked Language Modeling (MLM) objective with Whole Word Masking for 50k steps during 39 hours
66
+ with 2 Tesla V100.
67
+
68
+ ## Evaluation
69
+
70
+ <!-- This section describes the evaluation protocols and provides the results. -->
71
+
72
+ ### Fine-tuning
73
+
74
+ For fine-tuning, we utilized Optuna to select the hyperparameters.
75
+ The learning rate was set to 5e-5, with a warmup ratio of 0.224 and a batch size of 16.
76
+ The fine-tuning process was carried out for 2000 steps.
77
+ For prediction, a simple linear layer was added on top of the model.
78
+ Notably, none of the CamemBERT layers were frozen during the fine-tuning process.
79
+
80
+ ### Scoring
81
+
82
+ To evaluate the performance of the model, we used the seqeval tool in strict mode with the IOB2 scheme.
83
+ For each evaluation, the best fine-tuned model on the validation set was selected to calculate the final score on the test set.
84
+ To ensure reliability, we averaged over 10 evaluations with different seeds.
85
+
86
+ ### Results
87
+
88
+ | Style | Dataset | Score | CamemBERT | CamemBERT |
89
+ | :----------- | :------ | :---- | :---------------: | :-------------------: |
90
+ | Clinique | CAS1 | F1 | 70\.50 ~~±~~ 1.75 | **73\.03 ~~±~~ 1.29** |
91
+ | | | P | 70\.12 ~~±~~ 1.93 | 71\.71 ~~±~~ 1.61 |
92
+ | | | R | 70\.89 ~~±~~ 1.78 | **74\.42 ~~±~~ 1.49** |
93
+ | | CAS2 | F1 | 79\.02 ~~±~~ 0.92 | **81\.66 ~~±~~ 0.59** |
94
+ | | | P | 77\.3 ~~±~~ 1.36 | **80\.96 ~~±~~ 0.91** |
95
+ | | | R | 80\.83 ~~±~~ 0.96 | **82\.37 ~~±~~ 0.69** |
96
+ | | E3C | F1 | 67\.63 ~~±~~ 1.45 | **69\.85 ~~±~~ 1.58** |
97
+ | | | P | 78\.19 ~~±~~ 0.72 | **79\.11 ~~±~~ 0.42** |
98
+ | | | R | 59\.61 ~~±~~ 2.25 | **62\.56 ~~±~~ 2.50** |
99
+ | Notices | EMEA | F1 | 74\.14 ~~±~~ 1.95 | **76\.71 ~~±~~ 1.50** |
100
+ | | | P | 74\.62 ~~±~~ 1.97 | **76\.92 ~~±~~ 1.96** |
101
+ | | | R | 73\.68 ~~±~~ 2.22 | **76\.52 ~~±~~ 1.62** |
102
+ | Scientifique | MEDLINE | F1 | 65\.73 ~~±~~ 0.40 | **68\.47 ~~±~~ 0.54** |
103
+ | | | P | 64\.94 ~~±~~ 0.82 | **67\.77 ~~±~~ 0.88** |
104
+ | | | R | 66\.56 ~~±~~ 0.56 | **69\.21 ~~±~~ 1.32** |
105
+
106
+
107
+ ## Environmental Impact
108
+
109
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
110
+
111
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
112
+
113
+ - **Hardware Type:** 2 x Tesla V100
114
+ - **Hours used:** 39 hours
115
+ - **Provider:** INRIA clusters
116
+ - **Compute Region:** Paris, France
117
+ - **Carbon Emitted:** 0.84 kg CO2 eq.
118
+
119
+ <!-- ## Citation [optional] -->
120
+
121
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
122
+
123
+ <!-- **BibTeX:** -->
124
+
125
+ <!-- [More Information Needed] -->
126
+
127
+ <!-- **APA:** -->
128
+
129
+ <!-- [More Information Needed] -->