andersonbcdefg
/

distilmodernbert

Model card Files Files and versions Community

andersonbcdefg commited on Dec 22, 2024

Commit

022ec2d

·

verified ·

1 Parent(s): 4c22367

Update README.md

Files changed (1) hide show

README.md +7 -5

README.md CHANGED Viewed

@@ -1,4 +1,5 @@
-This is a version of ModernBERT-base distilled down to 16 layers out of 22. The last 6 local attention layers were removed:
 0. Global
 1. Local
@@ -23,8 +24,8 @@ This is a version of ModernBERT-base distilled down to 16 layers out of 22. The
 20. Local (REMOVED)
 21. Global
-Unfortunately the modeling code relies on the fact that the global-local patterns are the same throughout the model,
-so loading this bad boy takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
 model configuration to allow custom striping of global+local layers. For now, here's how to do it:
 1. Download the checkpoint (model.pt) from this repository.
@@ -54,8 +55,9 @@ model.model.load_state_dict(state_dict)
 5. Use the model! Yay!
 # Training Information
-This model was distilled from ModernBERT-base on the MiniPile dataset (JeanKaddour/minipile), which includes English and code data.
-Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits, and batch size of 16.
 The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
 I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
 "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!

+This is a version of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) distilled down to 16 layers out of 22.
+The last 6 local attention layers were removed:
 0. Global
 1. Local
 20. Local (REMOVED)
 21. Global
+Unfortunately the HuggingFace modeling code for ModernBERT relies on global-local attention patterns being uniform throughout the model,
+so loading this bad boy properly takes a bit of model surgery. I hope in the future that the HuggingFace team will update this
 model configuration to allow custom striping of global+local layers. For now, here's how to do it:
 1. Download the checkpoint (model.pt) from this repository.
 5. Use the model! Yay!
 # Training Information
+This model was distilled from ModernBERT-base on the [MiniPile dataset](https://huggingface.co/datasets/JeanKaddour/minipile),
+which includes English and code data. Distillation used all 1M samples in this dataset for 1 epoch, MSE loss on the logits,
+batch size of 16, AdamW optimizer, and constant learning rate of 1.0e-5.
 The embeddings/LM head were frozen and shared between the teacher and student; only the transformer blocks were trained.
 I have not yet evaluated this model. However, after the initial model surgery, it failed to correctly complete
 "The capital of France is [MASK]", and after training, it correctly says "Paris", so something good happened!