liangxz commited on
Commit
2af9c28
Β·
1 Parent(s): c86a96f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - chemistry
5
+ - biology
6
+ - protein
7
+ - instructions
8
+ ---
9
+
10
+
11
+
12
+ This repo contains a fully fine-tuned [LLaMA-7b](https://huggingface.co/decapoda-research/llama-7b-hf), trained on the 🧬 **protein-oriented instructions** from the πŸ§ͺ [Mol-Instructions](https://huggingface.co/datasets/zjunlp/Mol-Instructions) dataset.
13
+
14
+
15
+ Instructions for running it can be found at https://github.com/zjunlp/Mol-Instructions.
16
+
17
+ > Please refer to our [paper](https://arxiv.org/abs/2306.08018) for more details.
18
+
19
+ ![image.png](logo.png)
20
+
21
+ <h3> 🧬 Tasks</h3>
22
+
23
+ <details>
24
+ <summary><b>Protein design</b></summary>
25
+
26
+ - *Design a protein sequence that exhibits the desired activity and specificity.*
27
+ 1. The presence of Mg(2+) is necessary for the protein to function in the desired environment.
28
+ 2. The AMP, (6S)-NADPHX binding site should be located in a region of the protein that is accessible to the ligand.
29
+ 3. The designed protein should have ATP binding, NADPHX epimerase activity, metal ion binding, ADP-dependent NAD(P)H-hydrate dehydratase activity to facilitate nicotinamide nucleotide metabolic process.
30
+ 4. For general function, the protein need meet that Catalyzes the epimerization of the S- and R-forms of NAD(P)HX, a damaged form of NAD(P)H that is a result of enzymatic or heat-dependent hydration
31
+
32
+ ```
33
+ MSNELVLSREQVRRVDQRAIEAYGVPGIVLMENAGRGAAEIIRAACPSAQRVLIACGPGNNGGDGFVIARHLANAGWMVELLLACPADRITGDAQGNHEIIRRMNLPCAVMADARDLEAANDRFATADVIVDALLGTGASGPPREPIASLIRAINEAHRRVSAQPAPSVFAVDIPSGLDCDTGEAANPTVRADHTITFVARKIGFRNPAARDLLGRVHVVDIGAPRAAIQDALTGKSG
34
+ ```
35
+ </details>
36
+
37
+
38
+ <details>
39
+ <summary><b>Catalytic activity prediction</b></summary>
40
+
41
+ - *Please evaluate the following protein sequence and provide an explanation of the enzyme's catalytic activity, including the chemical reaction it facilitates:* MDKVAVAGFLPEELCASLSLSPSFRGNQIFQWIGKGVDSFDAMTNLSAELRASLAEKAILRSTRVSDVLKADDGTVKLQIQTEDDLAVETVLLTDKAARKTACVSCQAGCAMGCAFCKTGTLGLARNLSAAEIVEQFLYLEKHAGALDNIVFMGMGEPLLNLDALRKAIAVLTDKRGRNLSSRRITVSTVGIVSGIYDLANNGPDVRLAVSLTTADETLRRELMPASLTNPLSDLRQAISYYIEKTGKRVTLEAVLLSGKNTSEKNADSLIAFAKGLDVHVNLIPWNPVEGLSFVTPDPEETAQFVSRLEKGGLNVTLRMHRGKSISGACGQLGKTNPYA
42
+
43
+ ```
44
+ Based on the provided protein sequence, the enzyme appears to facilitate the chemical reaction: adenosine(37) in tRNA + 2 reduced [2Fe-2S]-[ferredoxin] + 2 S- adenosyl-L-methionine = 2-methyladenosine(37) in tRNA + 5'- deoxyadenosine + L-methionine + 2 oxidized [2Fe-2S]-[ferredoxin] + S- adenosyl-L-homocysteine.
45
+ ```
46
+ </details>
47
+
48
+
49
+ <details>
50
+ <summary><b>Protein function prediction</b></summary>
51
+
52
+ - *Analyze the following amino acid sequence, and determine the function of the resulting protein, its subcellular localization, and any biological processes it may be part of:*
53
+ MNGTVNASAPSKMSEVAVERLSNDKALKVIFVLGGPGSGKGTQCAKIAKHFGFTHLSVGDLLRAEINSGSKNGTMIESMINEGKIVRSEVTIKLLQRAMHESGNDKFLIDGFPRNEENRAAFENLEKIEPEFVLFFDCPMEEMERRILNRNQGRDDDKMETIRKRFKVFIESTLPVIEFYNLKGKLYKIDACKPADEVFEDVKAIFSRFRAKEDSSQQTNICTAKRFELVMCLIKRLFREIKRMWSSFFCKAL
54
+
55
+ ```
56
+ The protein characterized by the amino acid sequence demonstrates ATP binding, cytidylate kinase activity, uridylate kinase activity and is implicated in the 'de novo' pyrimidine nucleobase biosynthetic process, phosphorylation, pyrimidine nucleotide biosynthetic process. Its subcellular localization is primarily within the cytoplasm, nucleus.
57
+ ```
58
+ </details>
59
+
60
+
61
+ <details>
62
+ <summary><b>Functional description generation</b></summary>
63
+
64
+ - *Examine the given protein sequence and share a brief overview of its attributes:*
65
+ MKIVLASNNQGKLAELKAMLAPLGVQLLRQAELGIPEAAEPFRTFVENALAKARHASALSGLPALADDAGLCVEAFGGLPGVDTAFYAVQFGYAKGDANNVKALLAQMAGITDRRAALVSTLVAVRSAEDPEPLIACGRVAGEVALEPMGSNGFGFDPVMFIPEFGQTFAQLPVEVKNANSHRGKATRQMMALMRERWIT
66
+
67
+ ```
68
+ A concise description of the protein with the specified amino acid sequence includes: Pyrophosphatase that catalyzes the hydrolysis of nucleoside triphosphates to their monophosphate derivatives, with a high preference for the non-canonical purine nucleotides XTP (xanthosine triphosphate), dITP (deoxyinosine triphosphate) and ITP. Seems to function as a house-cleaning enzyme that removes non-canonical purine nucleotides from the nucleotide pool, thus preventing their incorporation into DNA/RNA and avoiding chromosomal lesions.
69
+ ```
70
+ </details>
71
+
72
+ <details>
73
+ <summary><b>Domain/Motif prediction</b></summary>
74
+
75
+ - *Given this protein sequence, can you identify any common protein motifs or domains that it may contain?*
76
+ MANTKYIFITGGVVSSLGKGIAAASIGALLESRGLSVSLIKVDPYINVDPGTMSPFQHGEVFVTEDGTETDLDLGHYERFVRFKASKKNNFTAGKVYETVIRNERKGNYLGGTVQVIPHITNEIKKRIKKGGQNKDIAIVEVGGTVGDIESQPFVEALRQMALELPNSSWAFVHLTLVPFINASGELKTKPTQHSVKELRSLGISPDVLVCRSEQELPKDEKNKIALFCSVPAKSVISMHDVDTVYSIPILLNKQKVDDTILKKLNLKIKKPNLNDWKRVVKAKLLPEKEVNVSFVGKYTELKDSYKSINEALEHAGIQNKAKVNINFVEAEQITSQNVRKVLKKSDAILVPGGFGERGIEGMILACKYARENNVPYLGICLGMQIAIIEYARNVLKLKSANSTEFDSSTKFPVIGLITEWSDISGKKEKRTKNSDLGGTMRLGGQVCKLKKKSNSYKMYKKSEIIERHRHRYEVNPNYKDKMIEQGLDVVGTSIDGKLVEMIELPSHKWFLACQFHPEFTSNPRDGHPIFNSYIKSTITK
77
+
78
+ ```
79
+ Our predictive analysis of the given protein sequence reveals possible domains or motifs. These include: Glutamine amidotransferase, CTP synthase N-terminal domains.
80
+ ```
81
+ </details>
82
+
83
+
84
+ <h3> πŸ“ Demo</h3>
85
+
86
+ As illustrated in [our repository](https://github.com/zjunlp/Mol-Instructions/tree/main/demo), we provide an example to perform generation.
87
+
88
+ For model fine-tuned on **protein-oriented** instructions, you can conveniently recover the model weights we trained through the following command.
89
+
90
+ Please download [llama-7b-hf](https://huggingface.co/decapoda-research/llama-7b-hf/tree/main) to obtain the pre-training weights of LLaMA-7B, refine the `--base_model` to point towards the location where the model weights are saved.
91
+
92
+ Then replace `$DIFF_WEIGHT_PATH` with the path of our provided [diff weights](https://huggingface.co/zjunlp/llama-molinst-protein-7b), and replace `$RECOVER_WEIGHT_PATH` with the desired path to save the recovered weights. If the directory of recovered weights lacks required files (e.g., tokenizer configuration files), you can copy from `$DIFF_WEIGHT_PATH`.
93
+
94
+ ```shell
95
+ python weight_diff.py recover \
96
+ --path_raw $BASE_MODEL_PATH \
97
+ --path_diff $DIFF_WEIGHT_PATH \
98
+ --path_tuned $RECOVER_WEIGHT_PATH
99
+ ```
100
+
101
+ After that, you can execute the following command to generate outputs with the fine-tuned LLaMA model.
102
+
103
+ ```shell
104
+ >> python generate.py \
105
+ --CLI True \
106
+ --protein True \
107
+ --base_model $RECOVER_WEIGHT_PATH \
108
+ ```
109
+
110
+
111
+ <h3> 🚨 Limitations</h3>
112
+
113
+ The current state of the model, obtained via instruction tuning, is a preliminary demonstration. Its capacity to handle real-world, production-grade tasks remains limited.
114
+
115
+ <h3> πŸ“š References</h3>
116
+ If you use our repository, please cite the following related paper:
117
+
118
+ ```
119
+ @article{molinst,
120
+ title={Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models},
121
+ author={Fang, Yin and Liang, Xiaozhuan and Zhang, Ningyu and Liu, Kangwei and Huang, Rui and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
122
+ journal={arXiv preprint arXiv:2306.08018},
123
+ year={2023}
124
+ }
125
+ ```
126
+
127
+ <h3> πŸ«±πŸ»β€πŸ«² Acknowledgements</h3>
128
+
129
+ We appreciate [LLaMA](https://github.com/facebookresearch/llama), [Huggingface Transformers Llama](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama), [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [Alpaca-LoRA](https://github.com/tloen/alpaca-lora), [Chatbot Service](https://github.com/deep-diver/LLM-As-Chatbot) and many other related works for their open-source contributions.