End of training
Browse files- README.md +15 -14
- benchmarks.shelve.bak +1 -0
- benchmarks.shelve.dat +0 -0
- benchmarks.shelve.dir +1 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735242.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee +3 -0
- tokenizer.json +2 -14
README.md
CHANGED
@@ -83,16 +83,17 @@ LlamaForCausalLM(
|
|
83 |
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
84 |
- student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
|
85 |
- student 7: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8`
|
86 |
-
|
87 |
-
|
88 |
-
|
|
89 |
-
|
|
90 |
-
|
|
91 |
-
| tinyGSM8k.exact_match,
|
92 |
-
|
|
93 |
-
|
|
94 |
-
|
|
95 |
-
|
|
|
|
96 |
|
97 |
# Resource Usage
|
98 |
|
@@ -155,7 +156,7 @@ LlamaForCausalLM(
|
|
155 |
<br/>
|
156 |
|
157 |
# Train Dataset
|
158 |
-
Trained on 1,
|
159 |
|
160 |
- Num Samples: `3,992,000`
|
161 |
- Subset: `20231101.en`
|
@@ -185,7 +186,7 @@ The following hyperparameters were used during training:
|
|
185 |
<details>
|
186 |
<summary>Expand</summary>
|
187 |
|
188 |
-
- learning_rate: `
|
189 |
- train_batch_size: `8`
|
190 |
- eval_batch_size: `4`
|
191 |
- seed: `42`
|
@@ -205,7 +206,7 @@ The following hyperparameters were used during training:
|
|
205 |
weight=0
|
206 |
)
|
207 |
)`
|
208 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
209 |
- student_model_name_or_path: `None`
|
210 |
- student_config_name_or_path: `None`
|
211 |
- student_model_config: `{'num_hidden_layers': 15}`
|
@@ -223,7 +224,7 @@ The following hyperparameters were used during training:
|
|
223 |
- dataset_sample_size: `4000000`
|
224 |
- dataset_max_seq_length: `1024`
|
225 |
- dataset_test_size: `0.002`
|
226 |
-
- dataset_shuffle: `
|
227 |
- dataset_shuffle_seed: `42`
|
228 |
- dataset_trust_remote_code: `False`
|
229 |
- gradient_accumulation_steps: `1`
|
|
|
83 |
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
84 |
- student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
|
85 |
- student 7: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8`
|
86 |
+
- student 8: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
|
87 |
+
|
88 |
+
| Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 | student 6 | student 7 | student 8 |
|
89 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
90 |
+
| tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** | 0.286 | 0.299 | 0.316 |
|
91 |
+
| tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 | 0.012 | 0.017 | 0.006 |
|
92 |
+
| tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
|
93 |
+
| tinyHellaswag.acc_norm,none | 0.452 | 0.341 | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 | **0.364** | 0.356 | 0.348 |
|
94 |
+
| tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | 0.31 | 0.286 | 0.279 | 0.292 | 0.295 | **0.328** | 0.311 |
|
95 |
+
| tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 | 0.44 | 0.436 | 0.433 |
|
96 |
+
| tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | 0.492 | 0.473 | 0.417 | 0.439 | 0.482 | **0.503** |
|
97 |
|
98 |
# Resource Usage
|
99 |
|
|
|
156 |
<br/>
|
157 |
|
158 |
# Train Dataset
|
159 |
+
Trained on 1,911,742,377 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
|
160 |
|
161 |
- Num Samples: `3,992,000`
|
162 |
- Subset: `20231101.en`
|
|
|
186 |
<details>
|
187 |
<summary>Expand</summary>
|
188 |
|
189 |
+
- learning_rate: `0.0001`
|
190 |
- train_batch_size: `8`
|
191 |
- eval_batch_size: `4`
|
192 |
- seed: `42`
|
|
|
206 |
weight=0
|
207 |
)
|
208 |
)`
|
209 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f3e58c6d840>`
|
210 |
- student_model_name_or_path: `None`
|
211 |
- student_config_name_or_path: `None`
|
212 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
224 |
- dataset_sample_size: `4000000`
|
225 |
- dataset_max_seq_length: `1024`
|
226 |
- dataset_test_size: `0.002`
|
227 |
+
- dataset_shuffle: `True`
|
228 |
- dataset_shuffle_seed: `42`
|
229 |
- dataset_trust_remote_code: `False`
|
230 |
- gradient_accumulation_steps: `1`
|
benchmarks.shelve.bak
CHANGED
@@ -7,3 +7,4 @@
|
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
9 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
|
|
|
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
9 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
|
10 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (4608, 448)
|
benchmarks.shelve.dat
CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
|
|
benchmarks.shelve.dir
CHANGED
@@ -7,3 +7,4 @@
|
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
9 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
|
|
|
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
9 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8', (4096, 448)
|
10 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (4608, 448)
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a60ae992017f68e492099d97c5925959bf021439cc8e10a31be25d279025ae17
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f908251b449e45e23df7fbb1d827c7a28cfbdab845a54b62b1a6f9ed82de4c37
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e12c6273d5f1ffd636dbfd715d763125805bcab8c34e08fe97835a6a59b905bf
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5e78094cbfc3cb81e219a39bf7842fa9a12c419e2d5510ccb9f5821e69632493
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6f45860a5508f603956084be9040c0d098fd04f9d9778db3e6250acdc4f6fb51
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:433f05ef7e78f50a3332fa1240c56434d480f0fd4ef5b69fcca24699cd7e70cc
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735242.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e4ecbbe898b1f47d3461713bec363a0f4651ef6b74a1b294e3116122af80b3f2
|
3 |
+
size 529
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_shuffle=True, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f4914652ee4cb2a79e9d90a10b3eed17cdcc399fbcdbd082f1f9f80ad6a61a61
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3c281abfddc59121368570b089bce764909315fef92d13d1e1a1b7c0b5da2880
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727735788.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:48e032821c110512dccc315beca1395aceffa3fde737f490d39b787c960d015b
|
3 |
+
size 562
|
tokenizer.json
CHANGED
@@ -1,19 +1,7 @@
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
-
"truncation":
|
4 |
-
|
5 |
-
"max_length": 1023,
|
6 |
-
"strategy": "LongestFirst",
|
7 |
-
"stride": 0
|
8 |
-
},
|
9 |
-
"padding": {
|
10 |
-
"strategy": "BatchLongest",
|
11 |
-
"direction": "Right",
|
12 |
-
"pad_to_multiple_of": null,
|
13 |
-
"pad_id": 0,
|
14 |
-
"pad_type_id": 0,
|
15 |
-
"pad_token": "<|endoftext|>"
|
16 |
-
},
|
17 |
"added_tokens": [
|
18 |
{
|
19 |
"id": 0,
|
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
+
"truncation": null,
|
4 |
+
"padding": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
"added_tokens": [
|
6 |
{
|
7 |
"id": 0,
|