SrikrishnaIyer
commited on
Upload 6 files
Browse files- README.md +51 -3
- config.json +26 -0
- merges.txt +0 -0
- model.safetensors +3 -0
- tokenizer_config.json +64 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -1,3 +1,51 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs
|
2 |
+
|
3 |
+
This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences.
|
4 |
+
|
5 |
+
## Approach
|
6 |
+
|
7 |
+
### Peer Model Initialization
|
8 |
+
|
9 |
+
Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models:
|
10 |
+
|
11 |
+
- For example, if `num_peers = 4`, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size)
|
12 |
+
- Optimize `num_layers`, `attention_heads`, and `hidden_size` to reach target parameter counts
|
13 |
+
- This ensures diversity while also reducing model size
|
14 |
+
|
15 |
+
The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size.
|
16 |
+
|
17 |
+
### Weighted Mutual Learning
|
18 |
+
|
19 |
+
We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss:
|
20 |
+
|
21 |
+
1. Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence)
|
22 |
+
2. Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss)
|
23 |
+
|
24 |
+
This allows the framework to dynamically adjust the importance of each peer during training.
|
25 |
+
|
26 |
+
## Hyperparameters of the champion peer model
|
27 |
+
|
28 |
+
| Hyperparameter | Value |
|
29 |
+
|----------------|-------|
|
30 |
+
| weight_decay | 0.1 |
|
31 |
+
| beta1 | 0.9 |
|
32 |
+
| beta2 | 0.95 |
|
33 |
+
| bayesian_init_points | 10 |
|
34 |
+
| bayesian_n_iter | 100 |
|
35 |
+
| grad_clip | 1.0 |
|
36 |
+
| prune_importance | 'l1' |
|
37 |
+
| layer_bound | 0.9 |
|
38 |
+
| batch_size | 3 |
|
39 |
+
| block_size | 512 |
|
40 |
+
| num_epochs | 100 |
|
41 |
+
| loss_alpha | 0.5 |
|
42 |
+
| num_batches | 60 |
|
43 |
+
| warmup_iters | 5 |
|
44 |
+
| learning_rate | 0.05 |
|
45 |
+
| lr_decay_iters | 200 |
|
46 |
+
| min_lr | 0.005 |
|
47 |
+
| enable_early_stopping | True |
|
48 |
+
|
49 |
+
## References
|
50 |
+
|
51 |
+
Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.
|
config.json
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"RobertaForMaskedLM"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.1,
|
6 |
+
"bos_token_id": 0,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"eos_token_id": 2,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 128,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 514,
|
16 |
+
"model_type": "roberta",
|
17 |
+
"num_attention_heads": 32,
|
18 |
+
"num_hidden_layers": 32,
|
19 |
+
"pad_token_id": 1,
|
20 |
+
"position_embedding_type": "absolute",
|
21 |
+
"torch_dtype": "float32",
|
22 |
+
"transformers_version": "4.37.0",
|
23 |
+
"type_vocab_size": 2,
|
24 |
+
"use_cache": true,
|
25 |
+
"vocab_size": 50265
|
26 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fa800ab4e97200542bfa5db649fdfd9a2fd3b3c7f157041d28ffe691abb37bd0
|
3 |
+
size 135924052
|
tokenizer_config.json
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"add_prefix_space": false,
|
3 |
+
"bos_token": {
|
4 |
+
"__type": "AddedToken",
|
5 |
+
"content": "<s>",
|
6 |
+
"lstrip": false,
|
7 |
+
"normalized": true,
|
8 |
+
"rstrip": false,
|
9 |
+
"single_word": false
|
10 |
+
},
|
11 |
+
"cls_token": {
|
12 |
+
"__type": "AddedToken",
|
13 |
+
"content": "<s>",
|
14 |
+
"lstrip": false,
|
15 |
+
"normalized": true,
|
16 |
+
"rstrip": false,
|
17 |
+
"single_word": false
|
18 |
+
},
|
19 |
+
"eos_token": {
|
20 |
+
"__type": "AddedToken",
|
21 |
+
"content": "</s>",
|
22 |
+
"lstrip": false,
|
23 |
+
"normalized": true,
|
24 |
+
"rstrip": false,
|
25 |
+
"single_word": false
|
26 |
+
},
|
27 |
+
"errors": "replace",
|
28 |
+
"mask_token": {
|
29 |
+
"__type": "AddedToken",
|
30 |
+
"content": "<mask>",
|
31 |
+
"lstrip": true,
|
32 |
+
"normalized": true,
|
33 |
+
"rstrip": false,
|
34 |
+
"single_word": false
|
35 |
+
},
|
36 |
+
"model_max_length": 512,
|
37 |
+
"name_or_path": "roberta-base",
|
38 |
+
"pad_token": {
|
39 |
+
"__type": "AddedToken",
|
40 |
+
"content": "<pad>",
|
41 |
+
"lstrip": false,
|
42 |
+
"normalized": true,
|
43 |
+
"rstrip": false,
|
44 |
+
"single_word": false
|
45 |
+
},
|
46 |
+
"sep_token": {
|
47 |
+
"__type": "AddedToken",
|
48 |
+
"content": "</s>",
|
49 |
+
"lstrip": false,
|
50 |
+
"normalized": true,
|
51 |
+
"rstrip": false,
|
52 |
+
"single_word": false
|
53 |
+
},
|
54 |
+
"special_tokens_map_file": null,
|
55 |
+
"tokenizer_class": "RobertaTokenizer",
|
56 |
+
"unk_token": {
|
57 |
+
"__type": "AddedToken",
|
58 |
+
"content": "<unk>",
|
59 |
+
"lstrip": false,
|
60 |
+
"normalized": true,
|
61 |
+
"rstrip": false,
|
62 |
+
"single_word": false
|
63 |
+
}
|
64 |
+
}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|