gbyuvd commited on
Commit
025e005
·
verified ·
1 Parent(s): ba054f8

Update model

Browse files
Files changed (3) hide show
  1. README.md +129 -34
  2. model.safetensors +1 -1
  3. pytorch_model.bin +1 -1
README.md CHANGED
@@ -29,7 +29,7 @@ model-index:
29
  type: fill-mask
30
  name: Fill-Mask
31
  dataset:
32
- name: main-eval-uniform
33
  type: main-eval-uniform
34
  metrics:
35
  - type: perplexity
@@ -42,7 +42,7 @@ model-index:
42
  type: fill-mask
43
  name: Fill-Mask
44
  dataset:
45
- name: main-eval-varied
46
  type: main-eval-varied
47
  metrics:
48
  - type: perplexity
@@ -51,15 +51,41 @@ model-index:
51
  - type: accuracy
52
  value: 0.876
53
  name: MLM Accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  license: cc-by-nc-sa-4.0
55
  ---
56
  # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
57
 
58
  This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
59
  - On varied masking:
60
- - Perplexity of 1.4759, MLM Accuracy of 87.60%
61
  - On uniform 15% masking:
62
- - Perplexity of 1.3978, MLM Accuracy of 89.29%
63
 
64
  The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
65
 
@@ -137,6 +163,55 @@ mask_filler(text, top_k=5)
137
  """
138
  ```
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ## Background
141
  Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
142
 
@@ -170,21 +245,21 @@ The dataset combines two sources of molecular data:
170
  - These validation sets were combined into a main test set, totaling 810,108 examples.
171
 
172
  | Dataset | Number of Valid Unique Molecules | Generated Training Examples |
173
- | ---------- | -------------------------------- | --------------------------- |
174
- | Chunk I | 207,727 | 560,859 |
175
- | Chunk II | 207,727 | 560,859 |
176
- | Chunk III | 207,727 | 560,859 |
177
- | Chunk IV | 207,727 | 560,859 |
178
- | Chunk V | 207,727 | 560,859 |
179
- | Chunk VI | 207,727 | 560,859 |
180
- | Chunk VII | 207,727 | 560,859 |
181
- | Chunk VIII | 207,727 | 560,859 |
182
- | Chunk IX | 207,727 | 560,859 |
183
- | Chunk X | 207,727 | 560,859 |
184
- | Chunk XII | 207,727 | 560,859 |
185
- | Chunk XI | 207,727 | 560,859 |
186
- | Chunk XIII | 207,738 | 560,889 |
187
- | Total | 2,700,462 | 7,291,197 |
188
 
189
  ### Training Procedure
190
 
@@ -274,8 +349,10 @@ This methodology aims to create a diverse and challenging dataset for masked lan
274
  #### Training Hyperparameters
275
 
276
  - Batch size = 128
277
- - Num of Epoch = 1
278
- - Total steps on all chunks = 56,966
 
 
279
  - Training time on each chunk = 03h:24m / ~205 mins
280
 
281
  I am using Ranger21 optimizer with these settings:
@@ -308,25 +385,44 @@ For more information about Ranger21, you could check out [this repository](https
308
  * Number of test examples: 810,108
309
 
310
  #### Varied Masking Test
 
311
 
312
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
313
- | ------- | -------- | ---------- | ------------ |
314
- | I-IV | 0.4547 | 1.5758 | 0.851 |
315
- | V-VIII | 0.4224 | 1.5257 | 0.864 |
316
- | IX-XIII | 0.3893 | 1.4759 | 0.876 |
 
 
 
 
 
 
 
317
 
318
  #### Uniform 15% Masking Test (80%:10%:10%)
319
 
 
 
320
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
321
- | ----- | -------- | ---------- | ------------ |
322
- | XII | 0.3349 | 1.3978 | 0.8929 |
 
 
 
 
 
 
 
323
 
324
  ## Interpretability
325
 
 
 
326
  ##### Attention Head Visualization
327
  (coming soon)
328
 
329
- ##### Neural Stacks Visualization
330
  (coming soon)
331
 
332
  ##### Attributions in Determining Masked Tokens
@@ -344,13 +440,12 @@ For more information about Ranger21, you could check out [this repository](https
344
 
345
  ### Compute Infrastructure
346
 
347
- ###### Hardware
348
-
349
- Platform: Paperspace's Gradients
350
 
351
- Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
 
352
 
353
- ###### Software
354
 
355
  - Python: 3.9.13
356
  - Transformers: 4.42.4
@@ -423,6 +518,6 @@ G Bayu ([email protected])
423
 
424
  This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
425
 
426
- If you find my work valuable and would like to support my journey, please consider suppoting me [here](ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
427
 
428
  Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.
 
29
  type: fill-mask
30
  name: Fill-Mask
31
  dataset:
32
+ name: main-eval-uniform (Epoch 1)
33
  type: main-eval-uniform
34
  metrics:
35
  - type: perplexity
 
42
  type: fill-mask
43
  name: Fill-Mask
44
  dataset:
45
+ name: main-eval-varied (Epoch 1)
46
  type: main-eval-varied
47
  metrics:
48
  - type: perplexity
 
51
  - type: accuracy
52
  value: 0.876
53
  name: MLM Accuracy
54
+ - task:
55
+ type: fill-mask
56
+ name: Fill-Mask
57
+ dataset:
58
+ name: main-eval-varied (Epoch 2)
59
+ type: main-eval-varied
60
+ metrics:
61
+ - type: perplexity
62
+ value: 1.4029
63
+ name: Perplexity
64
+ - type: accuracy
65
+ value: 0.8883
66
+ name: MLM Accuracy
67
+ - task:
68
+ type: fill-mask
69
+ name: Fill-Mask
70
+ dataset:
71
+ name: main-eval-uniform (Epoch 2)
72
+ type: main-eval-uniform
73
+ metrics:
74
+ - type: perplexity
75
+ value: 1.3276
76
+ name: Perplexity
77
+ - type: accuracy
78
+ value: 0.9055
79
+ name: MLM Accuracy
80
  license: cc-by-nc-sa-4.0
81
  ---
82
  # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
83
 
84
  This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
85
  - On varied masking:
86
+ - Perplexity of 1.4029, MLM Accuracy of 88.83%
87
  - On uniform 15% masking:
88
+ - Perplexity of 1.3276, MLM Accuracy of 90.55%
89
 
90
  The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
91
 
 
163
  """
164
  ```
165
 
166
+ In case you have SMILES instead, you can convert it first to SELFIES.
167
+ First install the selfies library:
168
+
169
+ ```bash
170
+ pip install selfies
171
+ ```
172
+ then you can convert using:
173
+
174
+ ```python
175
+ import selfies as sf
176
+
177
+ def smiles_to_selfies_sentence(smiles):
178
+ # Encode SMILES into SELFIES
179
+ try:
180
+ selfies = sf.encoder(smiles) # Encode SMILES into SELFIES
181
+ except sf.EncoderError as e:
182
+ print(f"Encoder Error: {e}")
183
+ pass
184
+
185
+
186
+ # Split SELFIES into individual tokens
187
+ selfies_tokens = list(sf.split_selfies(selfies))
188
+
189
+ # Join dots with the nearest next tokens
190
+ joined_tokens = []
191
+ i = 0
192
+ while i < len(selfies_tokens):
193
+ if selfies_tokens[i] == '.' and i + 1 < len(selfies_tokens):
194
+ joined_tokens.append(f".{selfies_tokens[i+1]}")
195
+ i += 2
196
+ else:
197
+ joined_tokens.append(selfies_tokens[i])
198
+ i += 1
199
+
200
+ # Join tokens with a whitespace to form a sentence
201
+ selfies_sentence = ' '.join(joined_tokens)
202
+
203
+ return selfies_sentence
204
+
205
+ # Example usage:
206
+ in_smi = "CN(C)CCC(C1=CC=C(C=C1)Cl)C2=CC=CC=N2.C(=CC(=O)O)C(=O)O" # Chlorphenamine maleate
207
+ selfies_sentence = smiles_to_selfies_sentence(in_smi)
208
+ print(selfies_sentence)
209
+
210
+ """
211
+ [C] [N] [Branch1] [C] [C] [C] [C] [C] [Branch1] [N] [C] [=C] [C] [=C] [Branch1] [Branch1] [C] [=C] [Ring1] [=Branch1] [Cl] [C] [=C] [C] [=C] [C] [=N] [Ring1] [=Branch1] .[C] [=Branch1] [#Branch1] [=C] [C] [=Branch1] [C] [=O] [O] [C] [=Branch1] [C] [=O] [O]
212
+ """
213
+ ```
214
+
215
  ## Background
216
  Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
217
 
 
245
  - These validation sets were combined into a main test set, totaling 810,108 examples.
246
 
247
  | Dataset | Number of Valid Unique Molecules | Generated Training Examples |
248
+ | ---------- | :------------------------------: | :-------------------------: |
249
+ | Chunk I | 207,727 | 560,859 |
250
+ | Chunk II | 207,727 | 560,859 |
251
+ | Chunk III | 207,727 | 560,859 |
252
+ | Chunk IV | 207,727 | 560,859 |
253
+ | Chunk V | 207,727 | 560,859 |
254
+ | Chunk VI | 207,727 | 560,859 |
255
+ | Chunk VII | 207,727 | 560,859 |
256
+ | Chunk VIII | 207,727 | 560,859 |
257
+ | Chunk IX | 207,727 | 560,859 |
258
+ | Chunk X | 207,727 | 560,859 |
259
+ | Chunk XII | 207,727 | 560,859 |
260
+ | Chunk XI | 207,727 | 560,859 |
261
+ | Chunk XIII | 207,738 | 560,889 |
262
+ | Total | 2,700,462 | 7,291,197 |
263
 
264
  ### Training Procedure
265
 
 
349
  #### Training Hyperparameters
350
 
351
  - Batch size = 128
352
+ - Num of Epoch:
353
+ - 1 epoch for all chunks
354
+ - another 1 epoch on selected chunks (but contains some samples from those excluded due to overfitting tendencies)
355
+ - Total steps on all chunks = 70,619
356
  - Training time on each chunk = 03h:24m / ~205 mins
357
 
358
  I am using Ranger21 optimizer with these settings:
 
385
  * Number of test examples: 810,108
386
 
387
  #### Varied Masking Test
388
+ ##### 1st Epoch
389
 
390
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
391
+ | ------- | :------: | :--------: | :----------: |
392
+ | I-IV | 0.4547 | 1.5758 | 0.851 |
393
+ | V-VIII | 0.4224 | 1.5257 | 0.864 |
394
+ | IX-XIII | 0.3893 | 1.4759 | 0.876 |
395
+
396
+ ##### 2nd Epoch
397
+
398
+ | Chunk | Avg Loss | Perplexity | MLM Accuracy |
399
+ | ----- | :------: | :--------: | :----------: |
400
+ | I-II | 0.3659 | 1.4418 | 0.8793 |
401
+ | VII | 0.3386 | 1.4029 | 0.8883 |
402
 
403
  #### Uniform 15% Masking Test (80%:10%:10%)
404
 
405
+ ##### 1st Epoch
406
+
407
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
408
+ | ----- | :------: | :--------: | :----------: |
409
+ | XII | 0.3349 | 1.3978 | 0.8929 |
410
+
411
+ ##### 2nd Epoch
412
+
413
+ | Chunk | Avg Loss | Perplexity | MLM Accuracy |
414
+ | ----- | :------: | :--------: | :----------: |
415
+ | | 0.2834 | 1.3276 | 0.9055 |
416
+
417
 
418
  ## Interpretability
419
 
420
+ Using Acetylcholine as an example, with its protonated nitrogen masked (*[N+1]*) for visualization.
421
+
422
  ##### Attention Head Visualization
423
  (coming soon)
424
 
425
+ ##### Neuron Views
426
  (coming soon)
427
 
428
  ##### Attributions in Determining Masked Tokens
 
440
 
441
  ### Compute Infrastructure
442
 
443
+ #### Hardware
 
 
444
 
445
+ - Platform: Paperspace's Gradients
446
+ - Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
447
 
448
+ #### Software
449
 
450
  - Python: 3.9.13
451
  - Transformers: 4.42.4
 
518
 
519
  This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
520
 
521
+ If you find my work valuable and would like to support my journey, please consider suppoting me [here](https://ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
522
 
523
  Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c5e990deed6f0300d37322ca0843e6ad21e629b70f5c08ce9396fcc0c73d0c1b
3
  size 44518452
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eed21113483b940971724817c724cbbbbe38aa35dc336e0527b218ef412639ac
3
  size 44518452
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e91f4170a40da86c836af0fab4215ae3242f4aa05ab07437cfa4ef33f9bdb7f1
3
  size 44557810
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ecddd24834ca750b4bbd01452bee86627548c56c9bdeb7632f7e46ab21be2a6
3
  size 44557810