Fill-Mask
Transformers
TensorBoard
Safetensors
bert
Inference Endpoints
w11wo commited on
Commit
275447a
·
verified ·
1 Parent(s): 5a3a538

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -33
README.md CHANGED
@@ -1,54 +1,78 @@
1
  ---
2
- license: mit
3
- base_model: indobenchmark/indobert-base-p1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  tags:
5
- - generated_from_trainer
6
- metrics:
7
- - accuracy
8
- model-index:
9
- - name: nusabert-base
10
- results: []
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
 
16
- # nusabert-base
17
 
18
- This model is a fine-tuned version of [indobenchmark/indobert-base-p1](https://huggingface.co/indobenchmark/indobert-base-p1) on the None dataset.
19
- It achieves the following results on the evaluation set:
20
- - Loss: 1.4876
21
- - Accuracy: 0.6866
22
 
23
- ## Model description
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
 
 
 
 
 
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
 
32
 
33
- More information needed
34
 
35
- ## Training procedure
 
 
36
 
37
- ### Training hyperparameters
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 0.0003
41
- - train_batch_size: 256
42
- - eval_batch_size: 256
43
- - seed: 42
44
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
- - lr_scheduler_type: linear
46
- - lr_scheduler_warmup_steps: 24000
47
- - training_steps: 500000
48
 
49
- ### Training results
50
 
 
51
 
 
 
 
 
 
 
 
 
52
 
53
  ### Framework versions
54
 
@@ -56,3 +80,25 @@ The following hyperparameters were used during training:
56
  - Pytorch 2.2.0+cu118
57
  - Datasets 2.17.1
58
  - Tokenizers 0.15.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - ind
5
+ - ace
6
+ - ban
7
+ - bjn
8
+ - bug
9
+ - gor
10
+ - jav
11
+ - min
12
+ - msa
13
+ - nia
14
+ - sun
15
+ - tet
16
+ language_bcp47:
17
+ - jv-x-bms
18
+ datasets:
19
+ - sabilmakbar/indo_wiki
20
+ - acul3/KoPI-NLLB
21
+ - uonlp/CulturaX
22
  tags:
23
+ - bert
 
 
 
 
 
24
  ---
25
 
26
+ # NusaBERT Base
 
27
 
28
+ NusaBERT Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:
29
 
30
+ - `eval_accuracy`: 0.6866
31
+ - `eval_loss`: 1.4876
32
+ - `perplexity`: 4.4266
 
33
 
34
+ This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license.
35
 
36
+ ## Model Detail
37
 
38
+ - **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/)
39
+ - **Finetuned from**: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1)
40
+ - **Model type**: Encoder-based BERT language model
41
+ - **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
42
+ - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
43
+ - **Contact**: [LazarusNLP](https://lazarusnlp.github.io/)
44
 
45
+ ## Use in 🤗Transformers
46
 
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
49
 
50
+ model_checkpoint = "LazarusNLP/NusaBERT-base"
51
 
52
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
53
+ model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
54
+ ```
55
 
56
+ ## Training Datasets
57
 
58
+ Around 16B tokens from the following corpora were used during pre-training.
59
+
60
+ - [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
61
+ - [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
62
+ - [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)
 
 
 
 
63
 
64
+ ## Training Hyperparameters
65
 
66
+ The following hyperparameters were used during training:
67
 
68
+ - `learning_rate`: 0.0003
69
+ - `train_batch_size`: 256
70
+ - `eval_batch_size`: 256
71
+ - `seed`: 42
72
+ - `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
73
+ - `lr_scheduler_type`: linear
74
+ - `lr_scheduler_warmup_steps`: 24000
75
+ - `training_steps`: 500000
76
 
77
  ### Framework versions
78
 
 
80
  - Pytorch 2.2.0+cu118
81
  - Datasets 2.17.1
82
  - Tokenizers 0.15.1
83
+
84
+ ## Credits
85
+
86
+ NusaBERT Base is developed with love by:
87
+
88
+ <div style="display: flex;">
89
+ <a href="https://github.com/anantoj">
90
+ <img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
91
+ </a>
92
+
93
+ <a href="https://github.com/DavidSamuell">
94
+ <img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
95
+ </a>
96
+
97
+ <a href="https://github.com/stevenlimcorn">
98
+ <img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
99
+ </a>
100
+
101
+ <a href="https://github.com/w11wo">
102
+ <img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
103
+ </a>
104
+ </div>