NusaBERT-base / README.md

Update README.md

e52beec verified 11 months ago

3.88 kB

	---
	license: apache-2.0
	language:
	- ind
	- ace
	- ban
	- bjn
	- bug
	- gor
	- jav
	- min
	- msa
	- nia
	- sun
	- tet
	language_bcp47:
	- jv-x-bms
	datasets:
	- sabilmakbar/indo_wiki
	- acul3/KoPI-NLLB
	- uonlp/CulturaX
	tags:
	- bert
	---

	# NusaBERT Base

	[NusaBERT](https://arxiv.org/abs/2403.01817) Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:

	- `eval_accuracy`: 0.6866
	- `eval_loss`: 1.4876
	- `perplexity`: 4.4266

	This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license.

	## Model Detail

	- Developed by: [LazarusNLP](https://lazarusnlp.github.io/)
	- Finetuned from: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1)
	- Model type: Encoder-based BERT language model
	- Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
	- Contact: [LazarusNLP](https://lazarusnlp.github.io/)

	## Use in 🤗Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	model_checkpoint = "LazarusNLP/NusaBERT-base"

	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
	```

	## Training Datasets

	Around 16B tokens from the following corpora were used during pre-training.

	- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
	- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
	- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)

	## Training Hyperparameters

	The following hyperparameters were used during training:

	- `learning_rate`: 0.0003
	- `train_batch_size`: 256
	- `eval_batch_size`: 256
	- `seed`: 42
	- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
	- `lr_scheduler_type`: linear
	- `lr_scheduler_warmup_steps`: 24000
	- `training_steps`: 500000

	### Framework versions

	- Transformers 4.37.2
	- Pytorch 2.2.0+cu118
	- Datasets 2.17.1
	- Tokenizers 0.15.1

	## Credits

	NusaBERT Base is developed with love by:

	<div style="display: flex;">
	<a href="https://github.com/anantoj">
	<img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/DavidSamuell">
	<img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/stevenlimcorn">
	<img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/w11wo">
	<img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>
	</div>

	## Citation

	```bib
	@misc{wongso2024nusabert,
	title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural},
	author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
	year={2024},
	eprint={2403.01817},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```