|
--- |
|
language: |
|
- code |
|
extra_gated_prompt: >- |
|
## Model License Agreement |
|
|
|
Please read the BigCode [OpenRAIL-M |
|
license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) |
|
agreement before accepting it. |
|
|
|
extra_gated_fields: |
|
I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox |
|
--- |
|
# StarEnCoder |
|
|
|
## Table of Contents |
|
|
|
1. [Model Summary](##model-summary) |
|
3. [Training](##training) |
|
4. [Use](##use) |
|
5. [Limitations](##limitations) |
|
6. [License](##license) |
|
|
|
## Model Summary |
|
|
|
This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. |
|
|
|
- **Project Website:** [bigcode-project.org](https://www.bigcode-project.org) |
|
- **Point of Contact:** [[email protected]](mailto:[email protected]) |
|
- **Languages:** 80+ Programming languages |
|
|
|
|
|
We leveraged the : |
|
- Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from [BERT](https://arxiv.org/abs/1810.04805). |
|
- Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document. |
|
|
|
## Training |
|
|
|
We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. |
|
Details about the model architecture are reported in the table below. |
|
|
|
| Hyperparameter | Value | |
|
|--------------------------|-----------| |
|
| Hidden size | 768 | |
|
| Intermediate size | 3072 | |
|
| Max. position embeddings | 1024 | |
|
| Num. of attention heads | 12 | |
|
| Num. of hidden layers | 12 | |
|
| Attention | Multi-head| |
|
| Num. of parameters | ≈125M | |
|
|
|
|
|
## Use |
|
|
|
This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks. |
|
We fine-tuned on a token classification task to detect PII and have released [StaPII](https://huggingface.co/bigcode/starpii) model. |
|
|
|
|
|
## Limitations |
|
There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks, |
|
and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages, |
|
particularly for less common ones, and the model might struggle with understanding domains outside programming languages. |
|
|
|
## License |
|
|
|
The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |