mmarimon commited on
Commit
1b80546
·
1 Parent(s): 0fad095

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -32
README.md CHANGED
@@ -20,45 +20,40 @@ widget:
20
  <details>
21
  <summary>Click to expand</summary>
22
 
23
- - [Model Description](#model-description)
24
- - [Intended Uses and Limitations](#intended-use)
25
- - [How to Use](#how-to-use)
26
  - [Limitations and bias](#limitations-and-bias)
27
  - [Training](#training)
28
- - [Training Data](#training-data)
29
- - [Training Procedure](#training-procedure)
30
  - [Evaluation](#evaluation)
31
- - [Additional Information](#additional-information)
32
- - [Contact Information](#contact-information)
33
  - [Copyright](#copyright)
34
- - [Licensing Information](#licensing-information)
35
  - [Funding](#funding)
36
- - [Citation Information](#citation-information)
37
- - [Contributions](#contributions)
38
  - [Disclaimer](#disclaimer)
39
 
40
  </details>
41
 
42
 
43
  ## Model description
44
-
45
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
46
 
47
- ## Intended uses & limitations
48
-
49
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
50
 
51
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
52
 
53
- ## How to Use
54
 
55
 
56
  ## Limitations and bias
57
-
58
 
59
  ## Training
60
 
61
-
62
  ### Tokenization and model pretraining
63
 
64
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
@@ -96,8 +91,7 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
96
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
97
 
98
 
99
-
100
- ## Evaluation and results
101
 
102
  The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
103
 
@@ -122,23 +116,22 @@ The fine-tuning scripts can be found in the official GitHub [repository](https:/
122
 
123
  ## Additional information
124
 
125
- ### Contact Information
 
126
 
 
127
  For further information, send an email to <[email protected]>
128
 
129
  ### Copyright
130
-
131
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
132
 
133
  ### Licensing information
134
-
135
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
136
 
137
  ### Funding
138
-
139
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
140
 
141
- ### Cite
142
  If you use these models, please cite our work:
143
 
144
  ```bibtext
@@ -164,13 +157,6 @@ If you use these models, please cite our work:
164
  abstract = "This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.",
165
  }
166
  ```
167
- ---
168
-
169
-
170
- ### Contributions
171
-
172
- [N/A]
173
-
174
 
175
  ### Disclaimer
176
 
 
20
  <details>
21
  <summary>Click to expand</summary>
22
 
23
+ - [Model description](#model-description)
24
+ - [Intended uses and limitations](#intended-use)
25
+ - [How to use](#how-to-use)
26
  - [Limitations and bias](#limitations-and-bias)
27
  - [Training](#training)
28
+ - [Tokenization and model pretraining](#Tokenization-modelpretraining)
29
+ - [Training corpora and preprocessing](#Trainingcorpora-preprocessing)
30
  - [Evaluation](#evaluation)
31
+ - [Additional information](#additional-information)
32
+ - [Contact information](#contact-information)
33
  - [Copyright](#copyright)
34
+ - [Licensing information](#licensing-information)
35
  - [Funding](#funding)
36
+ - [Citation information](#citation-information)
 
37
  - [Disclaimer](#disclaimer)
38
 
39
  </details>
40
 
41
 
42
  ## Model description
 
43
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
44
 
45
+ ## Intended uses and limitations
 
 
46
 
47
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
48
 
49
+ ## How to use
50
 
51
 
52
  ## Limitations and bias
53
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
54
 
55
  ## Training
56
 
 
57
  ### Tokenization and model pretraining
58
 
59
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 
91
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
92
 
93
 
94
+ ## Evaluation
 
95
 
96
  The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
97
 
 
116
 
117
  ## Additional information
118
 
119
+ ### Author
120
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ([email protected])
121
 
122
+ ### Contact information
123
  For further information, send an email to <[email protected]>
124
 
125
  ### Copyright
 
126
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
127
 
128
  ### Licensing information
 
129
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
130
 
131
  ### Funding
 
132
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
133
 
134
+ ### Citation information
135
  If you use these models, please cite our work:
136
 
137
  ```bibtext
 
157
  abstract = "This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.",
158
  }
159
  ```
 
 
 
 
 
 
 
160
 
161
  ### Disclaimer
162