File size: 2,403 Bytes
e866c03 e79dd5a 0df7bd8 e79dd5a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: cc-by-sa-4.0
---
# japanese-reversed-gpt2-medium-unidic
This is a medium-sized Japanese **reversed** GPT-2 model using BERT-like tokenizer.
Unlike most Language Models, this model generates sentences from right to left.
Not reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-gpt2-medium-unidic/).
# How to use
The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).
```sh
pip install torch torchvision torchaudio
pip install fugashi[unidic-lite]
pip install transformers
```
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic')
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic')
text = 'ので、散歩に行きました。'
bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
input_ids = bos + tokenizer.encode(text)[1:-1][::-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed then reversed
input_ids = torch.tensor(input_ids).unsqueeze(0)
output = model.generate(
input_ids,
do_sample=True,
max_new_tokens=30,
top_k=50,
top_p=0.95,
repetition_penalty=1.0,
num_return_sequences=1,
pad_token_id=0,
eos_token_id=32769,
)[0].flip(0)
print(tokenizer.decode(output))
```
# Model architecture
Transformer-based Language Model
- Layers: 24
- Heads: 16
- Dimensions of hidden states: 1024
# Training
We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.
The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
We employed 8 A100 GPUs for 17 days.
The perplexity on the validation set is 9.79.
# Tokenization
Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) provided by Tohoku NLP Group.
The texts are tokenized by MeCab and then WordPiece.
The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).
# License
[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
Copyright (c) 2021, Tohoku University
Copyright (c) 2023, Tokyo Institute of Technology |