File size: 3,882 Bytes
1e75298 275447a b340b08 275447a 1e75298 275447a 1e75298 e52beec 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 275447a 1e75298 b340b08 1e75298 b340b08 275447a b57f4a0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
license: apache-2.0
language:
- ind
- ace
- ban
- bjn
- bug
- gor
- jav
- min
- msa
- nia
- sun
- tet
language_bcp47:
- jv-x-bms
datasets:
- sabilmakbar/indo_wiki
- acul3/KoPI-NLLB
- uonlp/CulturaX
tags:
- bert
---
# NusaBERT Base
[NusaBERT](https://arxiv.org/abs/2403.01817) Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:
- `eval_accuracy`: 0.6866
- `eval_loss`: 1.4876
- `perplexity`: 4.4266
This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license.
## Model Detail
- **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/)
- **Finetuned from**: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1)
- **Model type**: Encoder-based BERT language model
- **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
- **Contact**: [LazarusNLP](https://lazarusnlp.github.io/)
## Use in 🤗Transformers
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
```
## Training Datasets
Around 16B tokens from the following corpora were used during pre-training.
- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)
## Training Hyperparameters
The following hyperparameters were used during training:
- `learning_rate`: 0.0003
- `train_batch_size`: 256
- `eval_batch_size`: 256
- `seed`: 42
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 24000
- `training_steps`: 500000
### Framework versions
- Transformers 4.37.2
- Pytorch 2.2.0+cu118
- Datasets 2.17.1
- Tokenizers 0.15.1
## Credits
NusaBERT Base is developed with love by:
<div style="display: flex;">
<a href="https://github.com/anantoj">
<img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/DavidSamuell">
<img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/stevenlimcorn">
<img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
<a href="https://github.com/w11wo">
<img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
</div>
## Citation
```bib
@misc{wongso2024nusabert,
title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural},
author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
year={2024},
eprint={2403.01817},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |