Model Card for Japanese BART large

Model description

This is a Japanese BART large model pre-trained on Japanese Wikipedia.

How to use

You can use this model as follows:

from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-large-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-large-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine-tune this model on downstream tasks.

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0-rc3 was used for pre-training. Each word is tokenized into subwords by sentencepiece.

Training data

We used the following corpora for pre-training:

  • Japanese Wikipedia (18M sentences)

Training procedure

We first segmented texts in the corpora into words using Juman++. Then, we built a sentencepiece model with 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using fairseq library. The training took about 1 month using 4 Tesla V100 GPUs.

The following hyperparameters were used during pre-training:

  • distributed_type: multi-GPU
  • num_devices: 4
  • batch_size: 512
  • training_steps: 250,000
  • encoder layers: 12
  • decoder layers: 12
  • hidden size: 1024
Downloads last month
400
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ku-nlp/bart-large-japanese