|
--- |
|
language: ja |
|
license: cc-by-sa-4.0 |
|
library_name: transformers |
|
datasets: |
|
- cc100 |
|
- mc4 |
|
- oscar |
|
- wikipedia |
|
- izumi-lab/cc100-ja |
|
- izumi-lab/mc4-ja-filter-ja-normal |
|
- izumi-lab/oscar2301-ja-filter-ja-normal |
|
- izumi-lab/wikipedia-ja-20230720 |
|
- izumi-lab/wikinews-ja-20230728 |
|
--- |
|
|
|
# DeBERTa V2 base Japanese |
|
|
|
This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts. |
|
The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1). |
|
|
|
|
|
## How to use |
|
|
|
You can use this model for masked language modeling as follows: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese", use_fast=False) |
|
model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese") |
|
... |
|
``` |
|
|
|
|
|
## Tokenization |
|
|
|
The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece). |
|
|
|
|
|
## Training Data |
|
|
|
We used the following corpora for pre-training: |
|
|
|
- [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja) |
|
- [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal) |
|
- [Japanese portion of OSCAR2301](https://huggingface.co/datasets/izumi-lab/oscar2301-ja-filter-ja-normal) |
|
- [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720) |
|
- [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728) |
|
|
|
|
|
## Training Parameters |
|
|
|
learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus. |
|
- learning_rate: 2.4e-4 (6e-5) |
|
- total_train_batch_size: 2,016 |
|
- max_seq_length: 512 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06 |
|
- lr_scheduler_type: linear schedule with warmup |
|
- training_steps: 1,000,000 |
|
- warmup_steps: 100,000 |
|
- precision: FP16 |
|
|
|
|
|
## Fine-tuning on General NLU tasks |
|
|
|
We evaluate our model with the average of five seeds. |
|
Other models are from [JGLUE repository](https://github.com/yahoojapan/JGLUE) |
|
|
|
|
|
| Model | JSTS | JNLI | JCommonsenseQA | |
|
|-------------------------------|------------------|-----------|----------------| |
|
| | Pearson/Spearman | acc | acc | |
|
| **DeBERTaV2 base** | **0.919/0.882** | **0.912** | **0.859** | |
|
| Waseda RoBERTa base | 0.913/0.873 | 0.895 | 0.840 | |
|
| Tohoku BERT base | 0.909/0.868 | 0.899 | 0.808 | |
|
|
|
|
|
## Citation |
|
|
|
Citation will be updated. |
|
Please check when you would cite. |
|
|
|
``` |
|
@article{Suzuki-etal-2023-ipm, |
|
title = {Constructing and analyzing domain-specific language model for financial text mining}, |
|
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi}, |
|
journal = {Information Processing \& Management}, |
|
volume = {60}, |
|
number = {2}, |
|
pages = {103194}, |
|
year = {2023}, |
|
doi = {10.1016/j.ipm.2022.103194} |
|
} |
|
@article{Suzuki-2024-findebertav2, |
|
jtitle = {{FinDeBERTaV2: 単語分割フリーな金融事前学習言語モデル}}, |
|
title = {{FinDeBERTaV2: Word-Segmentation-Free Pre-trained Language Model for Finance}}, |
|
jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 平野, 正徳 and 和泉, 潔}, |
|
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi}, |
|
jjournal = {人工知能学会論文誌}, |
|
journal = {Transactions of the Japanese Society for Artificial Intelligence}, |
|
volume = {39}, |
|
number = {4}, |
|
pages={FIN23-G_1-14}, |
|
year = {2024}, |
|
doi = {10.1527/tjsai.39-4_FIN23-G}, |
|
} |
|
``` |
|
|
|
|
|
## Licenses |
|
|
|
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). |
|
|
|
|
|
## Acknowledgments |
|
|
|
This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan. |
|
|