|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- wikipedia |
|
- cc100 |
|
language: |
|
- ja |
|
pipeline_tag: text-generation |
|
tags: |
|
- gpt |
|
- japanese |
|
- language model |
|
widget: |
|
- text: 今日はいい天気なので、 |
|
--- |
|
# japanese-gpt2-medium-unidic |
|
This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer. |
|
|
|
Reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-reversed-gpt2-medium-unidic/). |
|
|
|
# How to use |
|
The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers). |
|
|
|
```sh |
|
pip install torch torchvision torchaudio |
|
pip install fugashi[unidic-lite] |
|
pip install transformers |
|
``` |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic') |
|
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic') |
|
|
|
text = '今日はいい天気なので、' |
|
|
|
bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768] |
|
input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed |
|
input_ids = torch.tensor(input_ids).unsqueeze(0) |
|
output = model.generate( |
|
input_ids, |
|
do_sample=True, |
|
max_new_tokens=30, |
|
top_k=50, |
|
top_p=0.95, |
|
repetition_penalty=1.0, |
|
num_return_sequences=1, |
|
pad_token_id=0, |
|
eos_token_id=32769, |
|
)[0] |
|
|
|
print(tokenizer.decode(output)) |
|
``` |
|
|
|
# Model architecture |
|
Transformer-based Language Model |
|
- Layers: 24 |
|
- Heads: 16 |
|
- Dimensions of hidden states: 1024 |
|
|
|
# Training |
|
We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training. |
|
|
|
The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31). |
|
We employed 8 A100 GPUs for 17 days. |
|
The perplexity on the validation set is 9.80. |
|
|
|
# Tokenization |
|
Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) provided by Tohoku NLP Group. |
|
The texts are tokenized by MeCab and then WordPiece. |
|
|
|
The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token). |
|
|
|
# License |
|
[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
|
|
|
Copyright (c) 2021, Tohoku University |
|
|
|
Copyright (c) 2023, Tokyo Institute of Technology |