AIRI-Institute
/

gena-lm-bigbird-base-sparse-t2t

@@ -4,15 +4,15 @@ tags:
 - human_genome
 ---
-# GENA-LM (gena-lm-bigbird-base-sparse-t2t)
 GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
 GENA-LM models are transformer masked language models trained on human DNA sequence.
-`gena-lm-bigbird-base-sparse-t2t` follows the BigBird architecture and uses sparse attention from DeepSpeed.
-Differences between GENA-LM (`gena-lm-bigbird-base-sparse-t2t`) and DNABERT:
 - BPE tokenization instead of k-mers;
 - input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
 - pre-training on T2T vs. GRCh38.p13 human genome assembly.
@@ -22,7 +22,7 @@ Source code and data: https://github.com/AIRI-Institute/GENA_LM
 Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
 ## Installation
-`gena-lm-bigbird-base-sparse-t2t` sparse ops require DeepSpeed.
 ### DeepSpeed
 DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).
@@ -45,21 +45,45 @@ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp
 ## Examples
-### Load pre-trained model
 ```python
-from transformers import AutoTokenizer, BigBirdForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
-model = BigBirdForMaskedLM.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
 ```
-### How to load the model to fine-tune it on classification task
 ```python
-from transformers import AutoTokenizer, BigBirdForSequenceClassification
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
-model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
 ```
 ## Model description

 - human_genome
 ---
+# GENA-LM (gena-lm-bigbird-base-sparse-t2t-t2t)
 GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
 GENA-LM models are transformer masked language models trained on human DNA sequence.
+`gena-lm-bigbird-base-sparse-t2t-t2t` follows the BigBird architecture and uses sparse attention from DeepSpeed.
+Differences between GENA-LM (`gena-lm-bigbird-base-sparse-t2t-t2t`) and DNABERT:
 - BPE tokenization instead of k-mers;
 - input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
 - pre-training on T2T vs. GRCh38.p13 human genome assembly.
 Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
 ## Installation
+`gena-lm-bigbird-base-sparse-t2t-t2t` sparse ops require DeepSpeed.
 ### DeepSpeed
 DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).
 ## Examples
+### How to load pre-trained model for Masked Language Modeling
 ```python
+from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
+model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)
 ```
+### How to load pre-trained model to fine-tune it on classification task
+Get model class from GENA-LM repository:
+```bash
+git clone https://github.com/AIRI-Institute/GENA_LM.git
+```
 ```python
+from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
+from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
+model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
+```
+or you can just download [modeling_bert.py](https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm) and put it close to your code.
+OR you can get model class from HuggingFace AutoModel:
+```python
+from transformers import AutoTokenizer, AutoModel
+model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)
+gena_module_name = model.__class__.__module__
+print(gena_module_name)
+import importlib
+# available class names:
+# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
+# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
+# - BertForQuestionAnswering
+# check https://huggingface.co/docs/transformers/model_doc/bert
+cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
+print(cls)
+model = cls.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', num_labels=2)
 ```
 ## Model description