dkawahara commited on
Commit
8341167
·
1 Parent(s): 31161df

Updated README.md.

Browse files
Files changed (1) hide show
  1. README.md +32 -4
README.md CHANGED
@@ -15,20 +15,48 @@ widget:
15
 
16
  ## Model description
17
 
18
- This is a Japanese RoBERTa model pretrained on Japanese Wikipedia and the Japanese portion of CC-100.
19
 
20
  ## How to use
21
 
 
22
  ```python
23
- from transformers import AutoTokenizer,AutoModelForMaskedLM
24
- tokenizer=AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
25
- model=AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese")
 
 
 
 
26
  ```
27
 
 
 
28
  ## Tokenization
29
 
30
  The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
31
 
32
  ## Vocabulary
33
 
 
 
34
  ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Model description
17
 
18
+ This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC-100.
19
 
20
  ## How to use
21
 
22
+ You can use this model for masked language modeling as follows:
23
  ```python
24
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
25
+ tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
26
+ model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese")
27
+
28
+ sentence = '早稲田 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance
29
+ encoding = tokenizer(sentence, return_tensors='pt')
30
+ ...
31
  ```
32
 
33
+ You can use this model for fine-tuning on downstream tasks.
34
+
35
  ## Tokenization
36
 
37
  The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
38
 
39
  ## Vocabulary
40
 
41
+ The vocabulary consists of 32000 subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
42
+
43
  ## Training procedure
44
+
45
+ This model was trained on Japanese Wikipedia and the Japanese portion of CC-100. It took a week using eight NVIDIA A100 GPUs.
46
+
47
+ The following hyperparameters were used during pretraining:
48
+ - learning_rate: 1e-4
49
+ - per_device_train_batch_size: 256
50
+ - distributed_type: multi-GPU
51
+ - num_devices: 8
52
+ - gradient_accumulation_steps: 2
53
+ - total_train_batch_size: 4096
54
+ - max_seq_length: 128
55
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
56
+ - lr_scheduler_type: linear
57
+ - training_steps: 700000
58
+ - mixed_precision_training: Native AMP
59
+
60
+ ## Performance on JGLUE
61
+
62
+ coming soon