andersonbcdefg commited on
Commit
022ec2d
·
verified ·
1 Parent(s): 4c22367

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -1,4 +1,5 @@
1
- This is a version of ModernBERT-base distilled down to 16 layers out of 22. The last 6 local attention layers were removed:
 
2
 
3
  0. Global
4
  1. Local
@@ -23,8 +24,8 @@ This is a version of ModernBERT-base distilled down to 16 layers out of 22. The
23
  20. Local (REMOVED)
24
  21. Global
25
 
26
- Unfortunately the modeling code relies on the fact that the global-local patterns are the same throughout the model,
27
- so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
28
  model configuration to allow custom striping of global+local layers. For now, here's how to do it:
29
 
30
  1. Download the checkpoint (model.pt) from this repository.
@@ -54,8 +55,9 @@ model.model.load_state_dict(state_dict)
54
  5. Use the model! Yay!
55
 
56
  # Training Information
57
- This model was distilled from ModernBERT-base on the MiniPile dataset (JeanKaddour/minipile), which includes English and code data.
58
- Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, and batch size of 16.
 
59
  The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
60
  I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
61
  "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!
 
1
+ This is a version of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) distilled down to 16 layers out of 22.
2
+ The last 6 local attention layers were removed:
3
 
4
  0. Global
5
  1. Local
 
24
  20. Local (REMOVED)
25
  21. Global
26
 
27
+ Unfortunately the HuggingFace modeling code for ModernBERT relies on global-local attention patterns being uniform throughout the model,
28
+ so loading this bad boy properly takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
29
  model configuration to allow custom striping of global+local layers. For now, here's how to do it:
30
 
31
  1. Download the checkpoint (model.pt) from this repository.
 
55
  5. Use the model! Yay!
56
 
57
  # Training Information
58
+ This model was distilled from ModernBERT-base on the [MiniPile dataset](https://huggingface.co/datasets/JeanKaddour/minipile),
59
+ which includes English and code data. Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits,
60
+ batch size of 16, AdamW optimizer, and constant learning rate of 1.0e-5.
61
  The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
62
  I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
63
  "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!