File size: 3,875 Bytes
760dfcf
 
 
65d3517
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2141d9f
65d3517
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: openrail
---


# gpt2-azerbaijani-smallv0 model for text generation 


## Introduction

gpt2-azerbaijani-smallv0 is a state-of-the-art language model for Azerbaijani based on the GPT-2 small model.

It was trained on Azerbaijani Wikipedia using Transfer Learning and Fine-tuning techniques in ~ 29 hours, on one GPU - 1 x NVIDIA Tesla K80.

## Model

| Model                   | #params | Model file (pt) | Arch.       | Training /Validation data (text)         |
|-------------------------|---------|--------------------|-------------|------------------------------------------|
| gpt2-azerbaijani-smallv0| 124M    | 652               | GPT-2 small | Azerbaijani Wikipedia (110k articles / 19k articles) |


epoches - 3, loss - 5.17, accuracy - 23.99%, perplexity - 95.88


## How to use GPorTuguese-2 with HuggingFace (PyTorch)

The following code use PyTorch.

```python
import torch
from transformers import GPT2LMHeadModel, AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("nijatzeynalov/gpt2-azerbaijani-small")
tokenizer.model_max_length=1024 

model_state_dict = torch.load('GPT2_pt_3epoch_lr2e-3.pth', map_location=torch.device('cpu'))
model = GPT2LMHeadModel.from_pretrained('gpt2', state_dict=model_state_dict)

model.eval()

text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")

sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=20, 
                                top_k=10,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
```

## Bias

The training data used for this model come from Azerbaijani Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

## Limitations

__This model was developed for the purpose of research for the application of the GPT-2 model to the Azerbaijani language, and the results it produces are of very low quality due to resource limitations, the current version is not recommended for use in commercial projects.__

Since my current resources are limited, I will return to this model again, I plan to improve the results:

* Add more train data in Azerbaijani language; I plan to find and add 500k+ articles using various resources, not just wikipedia.
* Clean the Train dataset better; Currently, due to lack of resources, cleaning is hardly done.
* Running different experiments using a more powerful GPU. Only 1cycle policy for fine tuning technique was tested.
* Increase the number of Epoch; With the current GPU (GPU - 1 x NVIDIA Tesla K80), 1 epoch lasts about ~9 hours ($0.90/hr). Considering the goal of the project and other resources, I found it acceptable to stop at 3 epochs.

## Author

Azerbaijani GPT-2 small was trained and evaluated by [Nijat Zeynalov](https://www.linkedin.com/in/nijat-zeynalov-064163142/).