VirusT5: Harnessing Large Language Models to Predict SARS-CoV-2 Evolution

Github Link - https://github.com/vrmarathe/VirusT5

Overview

VirusT5 is a transformer-based language model built on the T5 architecture, designed to predict SARS-CoV-2 evolution. By modeling viral mutations as a "mutation-as-translation" process, VirusT5 captures mutation patterns in the Receptor-Binding Domain (RBD) of the spike protein, identifies mutation hotspots, and forecasts future viral strains.

Features

  • Variant Classification: Accurately classifies SARS-CoV-2 variants based on RBD sequences.
  • Mutation Prediction: Translates parental RBD sequences into evolved child sequences.
  • Generative Evolution: Simulates multi-generational viral evolution.

How It Works

VirusT5 is pretrained on 100,000 SARS-CoV-2 genome sequences from the GISAID database. Fine-tuning involves tasks like:

  1. Classifying RBD variant types.
  2. Translating parent-child mutation pairs to predict evolutionary changes.
  3. Simulating mutations across multiple viral generations.

How To Use The Pretrained Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer for the VirusT5 model
tokenizer = AutoTokenizer.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True)

# Load the pre-trained VirusT5 model (T5-based)
model = AutoModelForSeq2SeqLM.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True,from_flax=True)

Performance Highlights

  • Variant Classification Accuracy: 97.29%
  • Mutation Translation BLEU Score: 0.999
  • Multi-Generational Evolution Simulation Accuracy: 100%

Installation

Clone the repository and set up the required dependencies:

git clone https://github.com/vrmarathe/VirusT5.git  
cd VirusT5
cd environment
conda env create -f flax2_environment.yml

Datasets

VirusT5 was trained and fine-tuned using the following datasets:

1. Genome Dataset

2. Receptor Binding Domain (RBD) Dataset

  • Description: Contains genetic sequences encoding the receptor-binding domain of the SARS-CoV-2 spike protein.
  • Usage:
  • Details:
    • Codon-aware multiple sequence alignment (MSA) performed using MUSCLE.
    • Mapped to reference genome (NCBI: NC_004718.3).

3. Parent-Child Dataset

  • Description: Contains pairs of RBD sequences where one sequence acts as the evolutionary parent of the other.
  • Usage: Fine-tuning for "mutation-as-translation" tasks, where the model predicts the child sequence from the parent sequence.
  • Details:
    • Constructed from RBD sequences divided into 10 temporal bins.
    • Includes 500,000 parent-child pairs sampled across Alpha, Delta, Omicron, and non-VOC variants.

Notes

  • Access: While the datasets rely on public resources like GISAID, access may require registration or compliance with their terms of use.
  • Preprocessing: Preprocessing scripts for dataset preparation are available in the Preprocessing in Pretaining and FineTuning Datasets directory.
  • Datasets will be provided on request.

Pretraining and Fine-Tuning

Pretraining

VirusT5 was pretrained on a large corpus of SARS-CoV-2 genome sequences to learn the underlying syntax and grammar of genomic data.

  • Dataset: Genome Dataset comprising 100,000 SARS-CoV-2 genome sequences from GISAID.
  • Objective: Masked Language Modeling (MLM) with 15% token masking using sentinel tokens.
  • Sequence Length: Segmented into sequences of up to 512 base pairs.
  • Optimization:
    • Inverse square root learning rate schedule.
    • Initial learning rate: 0.005 for 2,000 steps, followed by exponential decay.
  • Training Hardware:
    • NDSU CCAST HPC clusters with 32 CPU cores, 100 GB RAM, and two NVIDIA A40 GPUs (40 GB each).
  • Duration: Pretrained for 12,000 steps.
  • The scripts for the pretraining can be found in the pretraining folder

Fine-Tuning

Fine-tuning tailored the pretrained VirusT5 model for specific downstream tasks, such as classification and mutation prediction.

Tasks

  1. Variant Classification:

    • Dataset: RBD Dataset, divided into training (60%), validation (20%), and test (20%) sets.
    • Objective: Predict variant types (e.g., Alpha, Delta, Omicron, non-VOC) from RBD sequences.
    • Result: Achieved 97.29% accuracy.
    • The original finetuning script for RBD classification can be found in the rbd-classification folder rbd-classifier.
    • The general classifier script can be used for other classification experiments can be found in General Classification
  2. Mutation Translation:

    • Dataset: Parent-Child Dataset with 500,000 RBD sequence pairs representing evolutionary parent-child relationships.
    • Objective: Predict how an RBD sequence evolves from one generation to the next.
    • The original finetuning script for RBD translation/evolution predication can be found in the RBD-translation.
    • The general mutation translation script can be used for other experiments and can be found in Translation-general
    • Evaluation:
      • BLEU Score: 0.999
      • Sequence Identity: 99.97% ± 0.1%
  3. For Other Tasks

    • The model is based on the T5 archictecture. The model can be fine-tuned to similar DNA/Genome/Virus related tasks that T5 was fine-tned on like summarization,question-answering etc.

Fine-Tuning Process

  • The model was trained and validated over multiple epochs until convergence, stopping when both training and validation losses stabilized.
  • The following split was used for all datasets:
    • Training: 60%
    • Validation: 20%
    • Testing: 20%
  • Fine-tuning used similar hardware as pretraining.

Citation

If you use VirusT5 in your research, please cite the following paper:

@misc{marathe2024virust5harnessinglargelanguage,
      title={VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution}, 
      author={Vishwajeet Marathe and Deewan Bajracharya and Changhui Yan},
      year={2024},
      eprint={2412.16262},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2412.16262}, 
}
Downloads last month
68
Inference API
Unable to determine this model's library. Check the docs .