|
--- |
|
license: apache-2.0 |
|
language: |
|
- ind |
|
- ace |
|
- ban |
|
- bjn |
|
- bug |
|
- gor |
|
- jav |
|
- min |
|
- msa |
|
- nia |
|
- sun |
|
- tet |
|
language_bcp47: |
|
- jv-x-bms |
|
datasets: |
|
- sabilmakbar/indo_wiki |
|
- acul3/KoPI-NLLB |
|
- uonlp/CulturaX |
|
tags: |
|
- bert |
|
--- |
|
|
|
# NusaBERT Base |
|
|
|
[NusaBERT](https://arxiv.org/abs/2403.01817) Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved: |
|
|
|
- `eval_accuracy`: 0.6866 |
|
- `eval_loss`: 1.4876 |
|
- `perplexity`: 4.4266 |
|
|
|
This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license. |
|
|
|
## Model Detail |
|
|
|
- **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/) |
|
- **Finetuned from**: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1) |
|
- **Model type**: Encoder-based BERT language model |
|
- **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum |
|
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) |
|
- **Contact**: [LazarusNLP](https://lazarusnlp.github.io/) |
|
|
|
## Use in 🤗Transformers |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
model_checkpoint = "LazarusNLP/NusaBERT-base" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) |
|
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) |
|
``` |
|
|
|
## Training Datasets |
|
|
|
Around 16B tokens from the following corpora were used during pre-training. |
|
|
|
- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki) |
|
- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB) |
|
- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX) |
|
|
|
## Training Hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
|
|
- `learning_rate`: 0.0003 |
|
- `train_batch_size`: 256 |
|
- `eval_batch_size`: 256 |
|
- `seed`: 42 |
|
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08` |
|
- `lr_scheduler_type`: linear |
|
- `lr_scheduler_warmup_steps`: 24000 |
|
- `training_steps`: 500000 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.37.2 |
|
- Pytorch 2.2.0+cu118 |
|
- Datasets 2.17.1 |
|
- Tokenizers 0.15.1 |
|
|
|
## Credits |
|
|
|
NusaBERT Base is developed with love by: |
|
|
|
<div style="display: flex;"> |
|
<a href="https://github.com/anantoj"> |
|
<img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;"> |
|
</a> |
|
|
|
<a href="https://github.com/DavidSamuell"> |
|
<img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;"> |
|
</a> |
|
|
|
<a href="https://github.com/stevenlimcorn"> |
|
<img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;"> |
|
</a> |
|
|
|
<a href="https://github.com/w11wo"> |
|
<img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;"> |
|
</a> |
|
</div> |
|
|
|
## Citation |
|
|
|
```bib |
|
@misc{wongso2024nusabert, |
|
title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, |
|
author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo}, |
|
year={2024}, |
|
eprint={2403.01817}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |