File size: 2,403 Bytes
e866c03
 
 
e79dd5a
 
0df7bd8
e79dd5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: cc-by-sa-4.0
---
# japanese-reversed-gpt2-medium-unidic
This is a medium-sized Japanese **reversed** GPT-2 model using BERT-like tokenizer.
Unlike most Language Models, this model generates sentences from right to left.

Not reversed version is published [here](https://huggingface.co/okazaki-lab/japanese-gpt2-medium-unidic/).

# How to use
The model depends on [PyTorch](https://pytorch.org/), [fugashi](https://github.com/polm/fugashi) with [unidic-lite](https://github.com/polm/unidic-lite), and [Hugging Face Transformers](https://github.com/huggingface/transformers).

```sh
pip install torch torchvision torchaudio
pip install fugashi[unidic-lite]
pip install transformers
```

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic')
model = AutoModelForCausalLM.from_pretrained('okazaki-lab/japanese-reversed-gpt2-medium-unidic')

text = 'ので、散歩に行きました。'

bos = tokenizer.convert_tokens_to_ids(['[BOS]']) # [32768]
input_ids = bos + tokenizer.encode(text)[1:-1][::-1] # [CLS] and [SEP] generated by BERT Tokenizer are removed then reversed
input_ids = torch.tensor(input_ids).unsqueeze(0)
output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=30,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.0,
    num_return_sequences=1,
    pad_token_id=0,
    eos_token_id=32769,
)[0].flip(0)

print(tokenizer.decode(output))
```

# Model architecture
Transformer-based Language Model
- Layers: 24
- Heads: 16
- Dimensions of hidden states: 1024

# Training
We used a [codebase](https://github.com/rinnakk/japanese-pretrained-models) provided by rinna Co., Ltd. for training.

The model was trained on Japanese CC-100 and Japanese Wikipedia (2022/01/31).
We employed 8 A100 GPUs for 17 days.
The perplexity on the validation set is 9.79.

# Tokenization
Our tokenizer is based on [the one](https://huggingface.co/cl-tohoku/bert-base-japanese-v2)  provided by Tohoku NLP Group.
The texts are tokenized by MeCab and then WordPiece.

The vocabulary size is 32771 (32768 original tokens + 2 special tokens + 1 unused token).

# License
[Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

Copyright (c) 2021, Tohoku University

Copyright (c) 2023, Tokyo Institute of Technology