File size: 2,850 Bytes
8ca572f
 
 
 
 
171d439
 
 
8ca572f
 
 
 
 
 
 
 
 
8ca2b25
2ed7125
 
 
47aec07
 
 
2ed7125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47aec07
4a185c9
ca358e9
47aec07
 
 
 
4a185c9
 
 
 
 
 
 
 
 
47aec07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: 
  - vi
thumbnail: "url to a thumbnail used in social sharing"
tags:
- News
- Language model
- GPT2
datasets:
- Private Vietnamese News dataset
metrics:
- rouge
- wer

---


# GPT-2 Fine-tuning With Vietnamese News
## Model description
A Fine-tuned Vietnamese GPT2 model which can generate Vietnamese news based on context (category + headline), based on the Vietnamese Wiki GPT2 pretrained model (https://huggingface.co/danghuy1999/gpt2-viwiki)

## Github
- https://github.com/Tuan-Lee-23/Vietnamese-News-Generative-Model

## Purpose
This model was made only for fun and experimental study. However, It gives impressive results
Most of the generative news are fake with unconfirmed information. Honestly, I feel fun about this project =))

## Dataset
The dataset is about 30k Vietnamese news dataset from website thanhnien.vn

## Result  
- Train Loss: 2.3
- Val loss: 2.5
- Rouge F1: 0.556
- Word error rate: 1.08

## Deployment
- You can run the model deployment in this Colab's [link](https://colab.research.google.com/drive/1ITnYPnngd_aqkFB2A5IhzSsX4jQSPOR1?usp=sharing)
- Then go to this link: https://gptvn.loca.lt
- You can choose any categories and give it some text for the headline, then generate. There we go
- P/s: I've already tried to deploy my model on Streamlit's cloud, but It was always being broken due to out of memory


## Example usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


"""
Category includes: ['thời sự ', 'thế giới', 'tài chính kinh doanh', 'đời sống', 'văn hoá', 'giải trí', 'giới trẻ', 'giáo dục','công nghệ', 'sức khoẻ']
"""

category = "thời sự"
headline = "Nam thanh niên"  # A full headline or only some text

text = f"<|startoftext|> {category} <|headline|> {headline}"

tokenizer = AutoTokenizer.from_pretrained("tuanle/VN-News-GPT2")
model= AutoModelForCausalLM.from_pretrained("tuanle/VN-News-GPT2").to(device)

input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
sample_outputs = model.generate(input_ids,
                                do_sample=True,
                                max_length=max_len,
                                min_length=min_len,
                                #    temperature = .8,
                                top_k= top_k,
                                top_p = top_p,
                                num_beams= num_beams,
                                early_stopping= True,
                                no_repeat_ngram_size= 2  ,
                                num_return_sequences= num_return_sequences)

for i, sample_output in enumerate(sample_outputs):
    temp = tokenizer.decode(sample_output.tolist())
    print(f">> Generated text {i+1}\n\n{temp}")
    print('\n---')
```