End of training
Browse files- README.md +44 -21
- benchmarks.shelve.bak +1 -0
- benchmarks.shelve.dat +0 -0
- benchmarks.shelve.dir +1 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727459787.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
- tokenizer.json +2 -14
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
base_model: HuggingFaceTB/SmolLM-135M
|
3 |
datasets:
|
4 |
-
-
|
5 |
library_name: Distily
|
6 |
license: creativeml-openrail-m
|
7 |
tags:
|
@@ -18,7 +18,7 @@ model-index:
|
|
18 |
|
19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
21 |
-
on dataset [
|
22 |
|
23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
24 |
should probably proofread and complete it, then remove this comment.
|
@@ -81,20 +81,21 @@ LlamaForCausalLM(
|
|
81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
83 |
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
84 |
-
|
85 |
-
|
86 |
-
|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
| tinyGSM8k.exact_match,
|
90 |
-
|
|
91 |
-
|
|
92 |
-
|
|
93 |
-
|
|
|
|
94 |
|
95 |
# Resource Usage
|
96 |
|
97 |
-
- Max Train VRAM Use: 13.
|
98 |
- Available VRAM: 23.4329 GB
|
99 |
- GPUs:
|
100 |
- 1x NVIDIA GeForce RTX 4090
|
@@ -124,6 +125,28 @@ LlamaForCausalLM(
|
|
124 |
(self_attn): LlamaSdpaAttention(
|
125 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
126 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
```
|
129 |
|
@@ -131,10 +154,10 @@ LlamaForCausalLM(
|
|
131 |
<br/>
|
132 |
|
133 |
# Train Dataset
|
134 |
-
Trained on
|
135 |
|
136 |
-
- Num Samples: `
|
137 |
-
- Subset: `
|
138 |
- Split: `train`
|
139 |
|
140 |
|
@@ -161,7 +184,7 @@ The following hyperparameters were used during training:
|
|
161 |
<details>
|
162 |
<summary>Expand</summary>
|
163 |
|
164 |
-
- learning_rate: `
|
165 |
- train_batch_size: `8`
|
166 |
- eval_batch_size: `4`
|
167 |
- seed: `42`
|
@@ -181,7 +204,7 @@ The following hyperparameters were used during training:
|
|
181 |
weight=0
|
182 |
)
|
183 |
)`
|
184 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
185 |
- student_model_name_or_path: `None`
|
186 |
- student_config_name_or_path: `None`
|
187 |
- student_model_config: `{'num_hidden_layers': 15}`
|
@@ -192,11 +215,11 @@ The following hyperparameters were used during training:
|
|
192 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
193 |
- teacher_load_in_8bit: `False`
|
194 |
- teacher_load_in_4bit: `False`
|
195 |
-
- dataset_uri: `
|
196 |
-
- dataset_subset: `
|
197 |
- dataset_split: `train`
|
198 |
- dataset_column_name: `text`
|
199 |
-
- dataset_sample_size: `
|
200 |
- dataset_max_seq_length: `1024`
|
201 |
- dataset_test_size: `0.002`
|
202 |
- dataset_shuffle: `False`
|
|
|
1 |
---
|
2 |
base_model: HuggingFaceTB/SmolLM-135M
|
3 |
datasets:
|
4 |
+
- wikimedia/wikipedia
|
5 |
library_name: Distily
|
6 |
license: creativeml-openrail-m
|
7 |
tags:
|
|
|
18 |
|
19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
21 |
+
on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
|
22 |
|
23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
24 |
should probably proofread and complete it, then remove this comment.
|
|
|
81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
83 |
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
84 |
+
- student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
|
85 |
+
|
86 |
+
| Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 | student 6 |
|
87 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
88 |
+
| tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** | 0.286 |
|
89 |
+
| tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 | 0.012 |
|
90 |
+
| tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
|
91 |
+
| tinyHellaswag.acc_norm,none | 0.452 | 0.341 | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 | **0.364** |
|
92 |
+
| tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 | 0.295 |
|
93 |
+
| tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 | 0.44 |
|
94 |
+
| tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 | 0.439 |
|
95 |
|
96 |
# Resource Usage
|
97 |
|
98 |
+
- Max Train VRAM Use: 13.1269 GB
|
99 |
- Available VRAM: 23.4329 GB
|
100 |
- GPUs:
|
101 |
- 1x NVIDIA GeForce RTX 4090
|
|
|
125 |
(self_attn): LlamaSdpaAttention(
|
126 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
127 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
128 |
+
@@ -10,17 +10,16 @@
|
129 |
+
(o_proj): Linear(in_features=576, out_features=576, bias=False)
|
130 |
+
(rotary_emb): LlamaRotaryEmbedding()
|
131 |
+
)
|
132 |
+
- (mlp): LlamaMLP(
|
133 |
+
+ (mlp): LigerSwiGLUMLP(
|
134 |
+
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
|
135 |
+
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
|
136 |
+
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
|
137 |
+
- (act_fn): SiLU()
|
138 |
+
)
|
139 |
+
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
140 |
+
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
141 |
+
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
142 |
+
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
143 |
+
)
|
144 |
+
)
|
145 |
+
- (norm): LlamaRMSNorm((576,), eps=1e-05)
|
146 |
+
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
147 |
+
(rotary_emb): LlamaRotaryEmbedding()
|
148 |
+
)
|
149 |
+
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
|
150 |
|
151 |
```
|
152 |
|
|
|
154 |
<br/>
|
155 |
|
156 |
# Train Dataset
|
157 |
+
Trained on 1,857,293,914 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
|
158 |
|
159 |
+
- Num Samples: `3,992,000`
|
160 |
+
- Subset: `20231101.en`
|
161 |
- Split: `train`
|
162 |
|
163 |
|
|
|
184 |
<details>
|
185 |
<summary>Expand</summary>
|
186 |
|
187 |
+
- learning_rate: `0.0001`
|
188 |
- train_batch_size: `8`
|
189 |
- eval_batch_size: `4`
|
190 |
- seed: `42`
|
|
|
204 |
weight=0
|
205 |
)
|
206 |
)`
|
207 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x766de39d92d0>`
|
208 |
- student_model_name_or_path: `None`
|
209 |
- student_config_name_or_path: `None`
|
210 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
215 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
216 |
- teacher_load_in_8bit: `False`
|
217 |
- teacher_load_in_4bit: `False`
|
218 |
+
- dataset_uri: `wikimedia/wikipedia`
|
219 |
+
- dataset_subset: `20231101.en`
|
220 |
- dataset_split: `train`
|
221 |
- dataset_column_name: `text`
|
222 |
+
- dataset_sample_size: `4000000`
|
223 |
- dataset_max_seq_length: `1024`
|
224 |
- dataset_test_size: `0.002`
|
225 |
- dataset_shuffle: `False`
|
benchmarks.shelve.bak
CHANGED
@@ -5,3 +5,4 @@
|
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
|
|
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
benchmarks.shelve.dat
CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
|
|
benchmarks.shelve.dir
CHANGED
@@ -5,3 +5,4 @@
|
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
|
|
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
8 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ecb25697ef7473b7b8b5667c93f6875e0df7d2d8d51da0be13197fd00326eb29
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c2fbbc999fdcdf1468846bc314f79a1c8adccdf843e07328638a3735b3a093cd
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0ca1b09a384e70ad76ad81e675af99c09aaa0191fd62338bf082102f390bce91
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ad1ce823ddaf3c3e06a9a6b7c1dbf7ec6d4f98433cc0de7b677cddc18e3ad5e2
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8b8e4a1d6b2bc49849d6254e186ae8f0645b868c943c576edebfd67c9efdd853
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6f626ce7e1138affbdf03f1c7c32a4f9935b13ee7d190d3c277b04761c26625b
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727459787.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9db57ae95333d43e007d253ff414f8364bd7568da1de63554ee5e1dcbd09f338
|
3 |
+
size 529
|
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a065d3f14191df20143698661ca8115aecd472b34caf6ef7848f74d8000a707f
|
3 |
+
size 562
|
tokenizer.json
CHANGED
@@ -1,19 +1,7 @@
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
-
"truncation":
|
4 |
-
|
5 |
-
"max_length": 1023,
|
6 |
-
"strategy": "LongestFirst",
|
7 |
-
"stride": 0
|
8 |
-
},
|
9 |
-
"padding": {
|
10 |
-
"strategy": "BatchLongest",
|
11 |
-
"direction": "Right",
|
12 |
-
"pad_to_multiple_of": null,
|
13 |
-
"pad_id": 0,
|
14 |
-
"pad_type_id": 0,
|
15 |
-
"pad_token": "<|endoftext|>"
|
16 |
-
},
|
17 |
"added_tokens": [
|
18 |
{
|
19 |
"id": 0,
|
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
+
"truncation": null,
|
4 |
+
"padding": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
"added_tokens": [
|
6 |
{
|
7 |
"id": 0,
|