okazaki-lab
/

japanese-gpt2-medium-unidic

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

japanese-gpt2-medium-unidic / README.md

tanidaiz's picture

Update README.md

c49548c almost 2 years ago

|

history blame contribute delete

2.42 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- wikipedia
	- cc100
	language:
	- ja
	pipeline_tag: text-generation
	tags:
	- gpt
	- japanese
	- language model
	widget:
	- text: 今日はいい天気なので、
	---
	# japanese-gpt2-medium-unidic
	This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer.

	Reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-reversed-gpt2-medium-unidic/).

	# How to use
	The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).

	```sh
	pip install torch torchvision torchaudio
	pip install fugashi[unidic-lite]
	pip install transformers
	```

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch
	tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')
	model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')

	text = '今日はいい天気なので、'

	bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
	input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed
	input_ids = torch.tensor(input_ids).unsqueeze(0)
	output = model.generate(
	input_ids,
	do_sample=True,
	max_new_tokens=30,
	top_k=50,
	top_p=0.95,
	repetition_penalty=1.0,
	num_return_sequences=1,
	pad_token_id=0,
	eos_token_id=32769,
	)[0]

	print(tokenizer.decode(output))
	```

	# Model architecture
	Transformer-based Language Model
	- Layers: 24
	- Heads: 16
	- Dimensions of hidden states: 1024

	# Training
	We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.

	The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
	We employed 8 A100 GPUs for 17 days.
	The perplexity on the validation set is 9.80.

	# Tokenization
	Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) provided by Tohoku NLP Group.
	The texts are tokenized by MeCab and then WordPiece.

	The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).

	# License
	[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

	Copyright (c) 2021, Tohoku University

	Copyright (c) 2023, Tokyo Institute of Technology