Update README.md

f802547 verified 7 months ago

4.14 kB

	---
	language: ja
	license: cc-by-sa-4.0
	library_name: transformers
	datasets:
	- cc100
	- mc4
	- oscar
	- wikipedia
	- izumi-lab/cc100-ja
	- izumi-lab/mc4-ja-filter-ja-normal
	- izumi-lab/oscar2301-ja-filter-ja-normal
	- izumi-lab/wikipedia-ja-20230720
	- izumi-lab/wikinews-ja-20230728
	---

	# DeBERTa V2 base Japanese

	This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts.
	The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1).


	## How to use

	You can use this model for masked language modeling as follows:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese", use_fast=False)
	model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese")
	...
	```


	## Tokenization

	The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece).


	## Training Data

	We used the following corpora for pre-training:

	- [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja)
	- [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal)
	- [Japanese portion of OSCAR2301](https://huggingface.co/datasets/izumi-lab/oscar2301-ja-filter-ja-normal)
	- [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720)
	- [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728)


	## Training Parameters

	learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus.
	- learning_rate: 2.4e-4 (6e-5)
	- total_train_batch_size: 2,016
	- max_seq_length: 512
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
	- lr_scheduler_type: linear schedule with warmup
	- training_steps: 1,000,000
	- warmup_steps: 100,000
	- precision: FP16


	## Fine-tuning on General NLU tasks

	We evaluate our model with the average of five seeds.
	Other models are from [JGLUE repository](https://github.com/yahoojapan/JGLUE)


	\| Model \| JSTS \| JNLI \| JCommonsenseQA \|
	\|-------------------------------\|------------------\|-----------\|----------------\|
	\| \| Pearson/Spearman \| acc \| acc \|
	\| DeBERTaV2 base \| 0.919/0.882 \| 0.912 \| 0.859 \|
	\| Waseda RoBERTa base \| 0.913/0.873 \| 0.895 \| 0.840 \|
	\| Tohoku BERT base \| 0.909/0.868 \| 0.899 \| 0.808 \|


	## Citation

	Citation will be updated.
	Please check when you would cite.

	```
	@article{Suzuki-etal-2023-ipm,
	title = {Constructing and analyzing domain-specific language model for financial text mining},
	author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
	journal = {Information Processing \& Management},
	volume = {60},
	number = {2},
	pages = {103194},
	year = {2023},
	doi = {10.1016/j.ipm.2022.103194}
	}
	@article{Suzuki-2024-findebertav2,
	jtitle = {{FinDeBERTaV2: 単語分割フリーな金融事前学習言語モデル}},
	title = {{FinDeBERTaV2: Word-Segmentation-Free Pre-trained Language Model for Finance}},
	jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 平野, 正徳 and 和泉, 潔},
	author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
	jjournal = {人工知能学会論文誌},
	journal = {Transactions of the Japanese Society for Artificial Intelligence},
	volume = {39},
	number = {4},
	pages={FIN23-G_1-14},
	year = {2024},
	doi = {10.1527/tjsai.39-4_FIN23-G},
	}
	```


	## Licenses

	The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).


	## Acknowledgments

	This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.