Fill-Mask
Transformers
TensorBoard
Safetensors
bert
Inference Endpoints
NusaBERT-base / README.md
w11wo's picture
Update README.md
e52beec verified
|
raw
history blame
3.88 kB
---
license: apache-2.0
language:
- ind
- ace
- ban
- bjn
- bug
- gor
- jav
- min
- msa
- nia
- sun
- tet
language_bcp47:
- jv-x-bms
datasets:
- sabilmakbar/indo_wiki
- acul3/KoPI-NLLB
- uonlp/CulturaX
tags:
- bert
---
# NusaBERT Base
[NusaBERT](https://arxiv.org/abs/2403.01817) Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:
- `eval_accuracy`: 0.6866
- `eval_loss`: 1.4876
- `perplexity`: 4.4266
This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license.
## Model Detail
- **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/)
- **Finetuned from**: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1)
- **Model type**: Encoder-based BERT language model
- **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
- **Contact**: [LazarusNLP](https://lazarusnlp.github.io/)
## Use in 🤗Transformers
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
```
## Training Datasets
Around 16B tokens from the following corpora were used during pre-training.
- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)
## Training Hyperparameters
The following hyperparameters were used during training:
- `learning_rate`: 0.0003
- `train_batch_size`: 256
- `eval_batch_size`: 256
- `seed`: 42
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 24000
- `training_steps`: 500000
### Framework versions
- Transformers 4.37.2
- Pytorch 2.2.0+cu118
- Datasets 2.17.1
- Tokenizers 0.15.1
## Credits
NusaBERT Base is developed with love by:
<div style="display: flex;">
<a href="https://github.com/anantoj">
<img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/DavidSamuell">
<img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/stevenlimcorn">
<img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/w11wo">
<img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
</div>
## Citation
```bib
@misc{wongso2024nusabert,
title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural},
author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
year={2024},
eprint={2403.01817},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```