tanidaiz's picture
Update README.md
c49548c
|
raw
history blame
2.42 kB
---
license: cc-by-sa-4.0
datasets:
- wikipedia
- cc100
language:
- ja
pipeline_tag: text-generation
tags:
- gpt
- japanese
- language model
widget:
- text: 今日はいい天気なので、
---
# japanese-gpt2-medium-unidic
This is a medium-sized Japanese GPT-2 model using BERT-like tokenizer.
Reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-reversed-gpt2-medium-unidic/).
# How to use
The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).
```sh
pip install torch torchvision torchaudio
pip install fugashi[unidic-lite]
pip install transformers
```
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-gpt2-medium-unidic')
text = '今日はいい天気なので、'
bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
input_ids = bos + tokenizer.encode(text)[1:-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed
input_ids = torch.tensor(input_ids).unsqueeze(0)
output = model.generate(
input_ids,
do_sample=True,
max_new_tokens=30,
top_k=50,
top_p=0.95,
repetition_penalty=1.0,
num_return_sequences=1,
pad_token_id=0,
eos_token_id=32769,
)[0]
print(tokenizer.decode(output))
```
# Model architecture
Transformer-based Language Model
- Layers: 24
- Heads: 16
- Dimensions of hidden states: 1024
# Training
We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.
The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
We employed 8 A100 GPUs for 17 days.
The perplexity on the validation set is 9.80.
# Tokenization
Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) provided by Tohoku NLP Group.
The texts are tokenized by MeCab and then WordPiece.
The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).
# License
[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
Copyright (c) 2021, Tohoku University
Copyright (c) 2023, Tokyo Institute of Technology