CoronaCentral BERT Model for Topic / Article Type Classification
This is the topic / article type multi-label classification for the CoronaCentral website. This forms part of the pipeline for downloading and processing coronavirus literature described in the corona-ml repo with available step-by-step descriptions. The method is described in the preprint and detailed performance results can be found in the machine learning details document.
This model was derived by fine-tuning the microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract model on this coronavirus sequence (document) classification task.
Usage
Below are two Google Colab notebooks with example usage of this sequence classification model using HuggingFace transformers and KTrain.
Training Data
The model is trained on ~3200 manually-curated articles sampled at various stages during the coronavirus pandemic. The code for training is available in the category_prediction directory of the main Github Repo. The data is available in the annotated_documents.json.gz file.
Inputs and Outputs
The model takes in a tokenized title and abstract (combined into a single string and separated by a new line). The outputs are topics and article types, broadly called categories in the pipeline code. The types are listed below. Some others are managed by hand-coded rules described in the step-by-step descriptions.
List of Article Types
- Comment/Editorial
- Meta-analysis
- News
- Review
List of Topics
- Clinical Reports
- Communication
- Contact Tracing
- Diagnostics
- Drug Targets
- Education
- Effect on Medical Specialties
- Forecasting & Modelling
- Health Policy
- Healthcare Workers
- Imaging
- Immunology
- Inequality
- Infection Reports
- Long Haul
- Medical Devices
- Misinformation
- Model Systems & Tools
- Molecular Biology
- Non-human
- Non-medical
- Pediatrics
- Prevalence
- Prevention
- Psychology
- Recommendations
- Risk Factors
- Surveillance
- Therapeutics
- Transmission
- Vaccines
- Downloads last month
- 14