Upload 6 files

Browse files

Files changed (6) hide show

README.md +51 -3
config.json +26 -0
merges.txt +0 -0
model.safetensors +3 -0
tokenizer_config.json +64 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
----
-license: mit
----

+# When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs
+This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences.
+## Approach
+### Peer Model Initialization
+Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models:
+- For example, if `num_peers = 4`, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size)
+- Optimize `num_layers`, `attention_heads`, and `hidden_size` to reach target parameter counts
+- This ensures diversity while also reducing model size
+The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size.
+### Weighted Mutual Learning
+We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss:
+1. Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence)
+2. Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss)
+This allows the framework to dynamically adjust the importance of each peer during training.
+## Hyperparameters of the champion peer model
+| Hyperparameter | Value |
+|----------------|-------|
+| weight_decay | 0.1 |
+| beta1 | 0.9 |
+| beta2 | 0.95 |
+| bayesian_init_points | 10 |
+| bayesian_n_iter | 100 |
+| grad_clip | 1.0 |
+| prune_importance | 'l1' |
+| layer_bound | 0.9 |
+| batch_size | 3 |
+| block_size | 512 |
+| num_epochs | 100 |
+| loss_alpha | 0.5 |
+| num_batches | 60 |
+| warmup_iters | 5 |
+| learning_rate | 0.05 |
+| lr_decay_iters | 200 |
+| min_lr | 0.005 |
+| enable_early_stopping | True |
+## References
+Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "RobertaForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 128,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 50265
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa800ab4e97200542bfa5db649fdfd9a2fd3b3c7f157041d28ffe691abb37bd0
+size 135924052

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "mask_token": {
+    "__type": "AddedToken",
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "model_max_length": 512,
+  "name_or_path": "roberta-base",
+  "pad_token": {
+    "__type": "AddedToken",
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "special_tokens_map_file": null,
+  "tokenizer_class": "RobertaTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff