andersonbcdefg
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,5 @@
|
|
1 |
-
This is a version of ModernBERT-base distilled down to 16 layers out of 22.
|
|
|
2 |
|
3 |
0. Global
|
4 |
1. Local
|
@@ -23,8 +24,8 @@ This is a version of ModernBERT-base distilled down to 16 layers out of 22. The
|
|
23 |
20. Local (REMOVED)
|
24 |
21. Global
|
25 |
|
26 |
-
Unfortunately the modeling code relies on
|
27 |
-
so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
|
28 |
model configuration to allow custom striping of global+local layers. For now, here's how to do it:
|
29 |
|
30 |
1. Download the checkpoint (model.pt) from this repository.
|
@@ -54,8 +55,9 @@ model.model.load_state_dict(state_dict)
|
|
54 |
5. Use the model! Yay!
|
55 |
|
56 |
# Training Information
|
57 |
-
This model was distilled from ModernBERT-base on the MiniPile dataset
|
58 |
-
Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits,
|
|
|
59 |
The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
|
60 |
I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
|
61 |
"The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!
|
|
|
1 |
+
This is a version of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) distilled down to 16 layers out of 22.
|
2 |
+
The last 6 local attention layers were removed:
|
3 |
|
4 |
0. Global
|
5 |
1. Local
|
|
|
24 |
20. Local (REMOVED)
|
25 |
21. Global
|
26 |
|
27 |
+
Unfortunately the HuggingFace modeling code for ModernBERT relies on global-local attention patterns being uniform throughout the model,
|
28 |
+
so loading this bad boy properly takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
|
29 |
model configuration to allow custom striping of global+local layers. For now, here's how to do it:
|
30 |
|
31 |
1. Download the checkpoint (model.pt) from this repository.
|
|
|
55 |
5. Use the model! Yay!
|
56 |
|
57 |
# Training Information
|
58 |
+
This model was distilled from ModernBERT-base on the [MiniPile dataset](https://huggingface.co/datasets/JeanKaddour/minipile),
|
59 |
+
which includes English and code data. Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits,
|
60 |
+
batch size of 16, AdamW optimizer, and constant learning rate of 1.0e-5.
|
61 |
The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
|
62 |
I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
|
63 |
"The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!
|