lapp0 commited on
Commit
9269478
·
verified ·
1 Parent(s): f2f53b4

End of training

Browse files
Files changed (15) hide show
  1. README.md +15 -14
  2. benchmarks.shelve.bak +1 -0
  3. benchmarks.shelve.dat +0 -0
  4. benchmarks.shelve.dir +1 -0
  5. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  6. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  7. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  8. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  9. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  10. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  11. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735242.1c1a426a2fee +3 -0
  12. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  13. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  14. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
  15. tokenizer.json +2 -14
README.md CHANGED
@@ -83,16 +83,17 @@ LlamaForCausalLM(
83
  - student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
84
  - student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
85
  - student 7: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8`
86
-
87
- | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 | student 6 | student 7 |
88
- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
89
- | tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** | 0.286 | 0.299 |
90
- | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 | 0.012 | 0.017 |
91
- | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
92
- | tinyHellaswag.acc_norm,none | 0.452 | 0.341 | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 | **0.364** | 0.356 |
93
- | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | 0.31 | 0.286 | 0.279 | 0.292 | 0.295 | **0.328** |
94
- | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 | 0.44 | 0.436 |
95
- | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 | 0.439 | 0.482 |
 
96
 
97
  # Resource Usage
98
 
@@ -155,7 +156,7 @@ LlamaForCausalLM(
155
  <br/>
156
 
157
  # Train Dataset
158
- Trained on 1,857,304,596 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
159
 
160
  - Num Samples: `3,992,000`
161
  - Subset: `20231101.en`
@@ -185,7 +186,7 @@ The following hyperparameters were used during training:
185
  <details>
186
  <summary>Expand</summary>
187
 
188
- - learning_rate: `6e-05`
189
  - train_batch_size: `8`
190
  - eval_batch_size: `4`
191
  - seed: `42`
@@ -205,7 +206,7 @@ The following hyperparameters were used during training:
205
  weight=0
206
  )
207
  )`
208
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d28e0fc5450>`
209
  - student_model_name_or_path: `None`
210
  - student_config_name_or_path: `None`
211
  - student_model_config: `{'num_hidden_layers': 15}`
@@ -223,7 +224,7 @@ The following hyperparameters were used during training:
223
  - dataset_sample_size: `4000000`
224
  - dataset_max_seq_length: `1024`
225
  - dataset_test_size: `0.002`
226
- - dataset_shuffle: `False`
227
  - dataset_shuffle_seed: `42`
228
  - dataset_trust_remote_code: `False`
229
  - gradient_accumulation_steps: `1`
 
83
  - student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
84
  - student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
85
  - student 7: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8`
86
+ - student 8: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
87
+
88
+ | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 | student 6 | student 7 | student 8 |
89
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
90
+ | tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** | 0.286 | 0.299 | 0.316 |
91
+ | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 | 0.012 | 0.017 | 0.006 |
92
+ | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
93
+ | tinyHellaswag.acc_norm,none | 0.452 | 0.341 | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 | **0.364** | 0.356 | 0.348 |
94
+ | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | 0.31 | 0.286 | 0.279 | 0.292 | 0.295 | **0.328** | 0.311 |
95
+ | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 | 0.44 | 0.436 | 0.433 |
96
+ | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | 0.492 | 0.473 | 0.417 | 0.439 | 0.482 | **0.503** |
97
 
98
  # Resource Usage
99
 
 
156
  <br/>
157
 
158
  # Train Dataset
159
+ Trained on 1,911,742,377 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
160
 
161
  - Num Samples: `3,992,000`
162
  - Subset: `20231101.en`
 
186
  <details>
187
  <summary>Expand</summary>
188
 
189
+ - learning_rate: `0.0001`
190
  - train_batch_size: `8`
191
  - eval_batch_size: `4`
192
  - seed: `42`
 
206
  weight=0
207
  )
208
  )`
209
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f3e58c6d840>`
210
  - student_model_name_or_path: `None`
211
  - student_config_name_or_path: `None`
212
  - student_model_config: `{'num_hidden_layers': 15}`
 
224
  - dataset_sample_size: `4000000`
225
  - dataset_max_seq_length: `1024`
226
  - dataset_test_size: `0.002`
227
+ - dataset_shuffle: `True`
228
  - dataset_shuffle_seed: `42`
229
  - dataset_trust_remote_code: `False`
230
  - gradient_accumulation_steps: `1`
benchmarks.shelve.bak CHANGED
@@ -7,3 +7,4 @@
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
9
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
 
 
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
9
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
10
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (4608, 448)
benchmarks.shelve.dat CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
 
benchmarks.shelve.dir CHANGED
@@ -7,3 +7,4 @@
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
9
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
 
 
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
9
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
10
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (4608, 448)
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a60ae992017f68e492099d97c5925959bf021439cc8e10a31be25d279025ae17
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f908251b449e45e23df7fbb1d827c7a28cfbdab845a54b62b1a6f9ed82de4c37
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e12c6273d5f1ffd636dbfd715d763125805bcab8c34e08fe97835a6a59b905bf
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e78094cbfc3cb81e219a39bf7842fa9a12c419e2d5510ccb9f5821e69632493
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f45860a5508f603956084be9040c0d098fd04f9d9778db3e6250acdc4f6fb51
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:433f05ef7e78f50a3332fa1240c56434d480f0fd4ef5b69fcca24699cd7e70cc
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735242.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e4ecbbe898b1f47d3461713bec363a0f4651ef6b74a1b294e3116122af80b3f2
3
+ size 529
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4914652ee4cb2a79e9d90a10b3eed17cdcc399fbcdbd082f1f9f80ad6a61a61
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c281abfddc59121368570b089bce764909315fef92d13d1e1a1b7c0b5da2880
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48e032821c110512dccc315beca1395aceffa3fde737f490d39b787c960d015b
3
+ size 562
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 0,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,