yurakuratov commited on
Commit
26ac5d6
·
1 Parent(s): ce4721e

docs: update usage example

Browse files
Files changed (1) hide show
  1. README.md +34 -10
README.md CHANGED
@@ -4,15 +4,15 @@ tags:
4
  - human_genome
5
  ---
6
 
7
- # GENA-LM (gena-lm-bigbird-base-sparse-t2t)
8
 
9
  GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
10
 
11
  GENA-LM models are transformer masked language models trained on human DNA sequence.
12
 
13
- `gena-lm-bigbird-base-sparse-t2t` follows the BigBird architecture and uses sparse attention from DeepSpeed.
14
 
15
- Differences between GENA-LM (`gena-lm-bigbird-base-sparse-t2t`) and DNABERT:
16
  - BPE tokenization instead of k-mers;
17
  - input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
18
  - pre-training on T2T vs. GRCh38.p13 human genome assembly.
@@ -22,7 +22,7 @@ Source code and data: https://github.com/AIRI-Institute/GENA_LM
22
  Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
23
 
24
  ## Installation
25
- `gena-lm-bigbird-base-sparse-t2t` sparse ops require DeepSpeed.
26
 
27
  ### DeepSpeed
28
  DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).
@@ -45,21 +45,45 @@ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp
45
 
46
  ## Examples
47
 
48
- ### Load pre-trained model
49
  ```python
50
- from transformers import AutoTokenizer, BigBirdForMaskedLM
51
 
52
  tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
53
- model = BigBirdForMaskedLM.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
 
54
  ```
55
 
 
 
 
 
 
56
 
57
- ### How to load the model to fine-tune it on classification task
58
  ```python
59
- from transformers import AutoTokenizer, BigBirdForSequenceClassification
 
60
 
61
  tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
62
- model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
  ## Model description
 
4
  - human_genome
5
  ---
6
 
7
+ # GENA-LM (gena-lm-bigbird-base-sparse-t2t-t2t)
8
 
9
  GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
10
 
11
  GENA-LM models are transformer masked language models trained on human DNA sequence.
12
 
13
+ `gena-lm-bigbird-base-sparse-t2t-t2t` follows the BigBird architecture and uses sparse attention from DeepSpeed.
14
 
15
+ Differences between GENA-LM (`gena-lm-bigbird-base-sparse-t2t-t2t`) and DNABERT:
16
  - BPE tokenization instead of k-mers;
17
  - input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
18
  - pre-training on T2T vs. GRCh38.p13 human genome assembly.
 
22
  Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
23
 
24
  ## Installation
25
+ `gena-lm-bigbird-base-sparse-t2t-t2t` sparse ops require DeepSpeed.
26
 
27
  ### DeepSpeed
28
  DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100).
 
45
 
46
  ## Examples
47
 
48
+ ### How to load pre-trained model for Masked Language Modeling
49
  ```python
50
+ from transformers import AutoTokenizer, AutoModel
51
 
52
  tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
53
+ model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)
54
+
55
  ```
56
 
57
+ ### How to load pre-trained model to fine-tune it on classification task
58
+ Get model class from GENA-LM repository:
59
+ ```bash
60
+ git clone https://github.com/AIRI-Institute/GENA_LM.git
61
+ ```
62
 
 
63
  ```python
64
+ from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
65
+ from transformers import AutoTokenizer
66
 
67
  tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
68
+ model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t')
69
+ ```
70
+ or you can just download [modeling_bert.py](https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm) and put it close to your code.
71
+
72
+ OR you can get model class from HuggingFace AutoModel:
73
+ ```python
74
+ from transformers import AutoTokenizer, AutoModel
75
+ model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', trust_remote_code=True)
76
+ gena_module_name = model.__class__.__module__
77
+ print(gena_module_name)
78
+ import importlib
79
+ # available class names:
80
+ # - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
81
+ # - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
82
+ # - BertForQuestionAnswering
83
+ # check https://huggingface.co/docs/transformers/model_doc/bert
84
+ cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
85
+ print(cls)
86
+ model = cls.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-sparse-t2t', num_labels=2)
87
  ```
88
 
89
  ## Model description