nijatzeynalov
commited on
Commit
·
65d3517
1
Parent(s):
5dcd23c
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,76 @@
|
|
1 |
---
|
2 |
license: openrail
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: openrail
|
3 |
---
|
4 |
+
|
5 |
+
|
6 |
+
# gpt2-azerbaijani-smallv0 model for text generation
|
7 |
+
|
8 |
+
|
9 |
+
## Introduction
|
10 |
+
|
11 |
+
gpt2-azerbaijani-smallv0 is a state-of-the-art language model for Azerbaijani based on the GPT-2 small model.
|
12 |
+
|
13 |
+
It was trained on Azerbaijani Wikipedia using Transfer Learning and Fine-tuning techniques in ~ 29 hours, on one GPU - 1 x NVIDIA Tesla K80.
|
14 |
+
|
15 |
+
## Model
|
16 |
+
|
17 |
+
| Model | #params | Model file (pt) | Arch. | Training /Validation data (text) |
|
18 |
+
|-------------------------|---------|--------------------|-------------|------------------------------------------|
|
19 |
+
| gpt2-azerbaijani-smallv0| 124M | 652 | GPT-2 small | Azerbaijani Wikipedia (110k articles / 19k articles) |
|
20 |
+
|
21 |
+
|
22 |
+
epoches - 3, loss - 5.17, accuracy - 23.99%, perplexity - 95.88
|
23 |
+
|
24 |
+
|
25 |
+
## How to use GPorTuguese-2 with HuggingFace (PyTorch)
|
26 |
+
|
27 |
+
The following code use PyTorch.
|
28 |
+
|
29 |
+
```python
|
30 |
+
import torch
|
31 |
+
from transformers import GPT2LMHeadModel, AutoTokenizer, AutoModelWithLMHead
|
32 |
+
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained("nijatzeynalov/gpt2-azerbaijani-small")
|
34 |
+
tokenizer.model_max_length=1024
|
35 |
+
|
36 |
+
model_state_dict = torch.load('GPT2_pt_3epoch_lr2e-3.pth', map_location=torch.device('cpu'))
|
37 |
+
model = GPT2LMHeadModel.from_pretrained('gpt2', state_dict=model_state_dict)
|
38 |
+
|
39 |
+
model.eval()
|
40 |
+
|
41 |
+
text = "Your prompt here"
|
42 |
+
inputs = tokenizer(text, return_tensors="pt")
|
43 |
+
|
44 |
+
sample_outputs = model.generate(inputs.input_ids,
|
45 |
+
pad_token_id=50256,
|
46 |
+
do_sample=True,
|
47 |
+
max_length=20,
|
48 |
+
top_k=10,
|
49 |
+
num_return_sequences=1)
|
50 |
+
|
51 |
+
# generated sequence
|
52 |
+
for i, sample_output in enumerate(sample_outputs):
|
53 |
+
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
|
54 |
+
```
|
55 |
+
|
56 |
+
## Bias
|
57 |
+
|
58 |
+
The training data used for this model come from Azerbaijani Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:
|
59 |
+
|
60 |
+
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.
|
61 |
+
|
62 |
+
## Limitations
|
63 |
+
|
64 |
+
This model was developed for the purpose of research for the application of the GPT-2 model to the Azerbaijani language, and the results it produces are of very low quality due to resource limitations, the current version is not recommended for use in commercial projects.
|
65 |
+
|
66 |
+
Since my current resources are limited, I will return to this model again, I plan to improve the results:
|
67 |
+
|
68 |
+
* Add more train data in Azerbaijani language; I plan to find and add 500k+ articles using various resources, not just wikipedia.
|
69 |
+
* Clean the Train dataset better; Currently, due to lack of resources, cleaning is hardly done.
|
70 |
+
* Running different experiments using a more powerful GPU. Only 1cycle policy for fine tuning technique was tested.
|
71 |
+
* Increase the number of Epoch; With the current GPU (GPU - 1 x NVIDIA Tesla K80), 1 epoch lasts about ~9 hours ($0.90/hr). Considering the goal of the project and other resources, I found it acceptable to stop at 3 epochs.
|
72 |
+
|
73 |
+
## Author
|
74 |
+
|
75 |
+
Azerbaijani GPT-2 small was trained and evaluated by [Nijat Zeynalov](https://www.linkedin.com/in/nijat-zeynalov-064163142/).
|
76 |
+
|