metadata
license: apache-2.0
language:
- ind
- ace
- ban
- bjn
- bug
- gor
- jav
- min
- msa
- nia
- sun
- tet
language_bcp47:
- jv-x-bms
datasets:
- sabilmakbar/indo_wiki
- acul3/KoPI-NLLB
- uonlp/CulturaX
tags:
- bert
NusaBERT Base
NusaBERT Base is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:
eval_accuracy
: 0.6866eval_loss
: 1.4876perplexity
: 4.4266
This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-base is released under Apache 2.0 license.
Model Detail
- Developed by: LazarusNLP
- Finetuned from: IndoBERT base p1
- Model type: Encoder-based BERT language model
- Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
- License: Apache 2.0
- Contact: LazarusNLP
Use in 🤗Transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
Training Datasets
Around 16B tokens from the following corpora were used during pre-training.
- Indonesian Wikipedia Data Repository
- KoPI-NLLB (Korpus Perayapan Indonesia)
- Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages
Training Hyperparameters
The following hyperparameters were used during training:
learning_rate
: 0.0003train_batch_size
: 256eval_batch_size
: 256seed
: 42optimizer
: Adam withbetas=(0.9,0.999)
andepsilon=1e-08
lr_scheduler_type
: linearlr_scheduler_warmup_steps
: 24000training_steps
: 500000
Framework versions
- Transformers 4.37.2
- Pytorch 2.2.0+cu118
- Datasets 2.17.1
- Tokenizers 0.15.1
Credits
NusaBERT Base is developed with love by: