|
--- |
|
language: |
|
- vi |
|
thumbnail: "url to a thumbnail used in social sharing" |
|
tags: |
|
- News |
|
- Language model |
|
- GPT2 |
|
datasets: |
|
- Private Vietnamese News dataset |
|
metrics: |
|
- rouge |
|
- wer |
|
|
|
--- |
|
|
|
|
|
# GPT-2 Fine-tuning With Vietnamese News |
|
## Model description |
|
A Fine-tuned Vietnamese GPT2 model which can generate Vietnamese news based on context (category + headline), based on the Vietnamese Wiki GPT2 pretrained model (https://huggingface.co/danghuy1999/gpt2-viwiki) |
|
|
|
## Github |
|
- https://github.com/Tuan-Lee-23/Vietnamese-News-Generative-Model |
|
|
|
## Purpose |
|
This model was made only for fun and experimental study. However, It gives impressive results |
|
Most of the generative news are fake with unconfirmed information. Honestly, I feel fun about this project =)) |
|
|
|
## Dataset |
|
The dataset is about 30k Vietnamese news dataset from website thanhnien.vn |
|
|
|
## Result |
|
- Train Loss: 2.3 |
|
- Val loss: 2.5 |
|
- Rouge F1: 0.556 |
|
- Word error rate: 1.08 |
|
|
|
## Deployment |
|
- You can run the model deployment in this Colab's [link](https://colab.research.google.com/drive/1ITnYPnngd_aqkFB2A5IhzSsX4jQSPOR1?usp=sharing) |
|
- Then go to this link: https://gptvn.loca.lt |
|
- You can choose any categories and give it some text for the headline, then generate. There we go |
|
- P/s: I've already tried to deploy my model on Streamlit's cloud, but It was always being broken due to out of memory |
|
|
|
|
|
## Example usage |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
|
|
""" |
|
Category includes: ['thời sự ', 'thế giới', 'tài chính kinh doanh', 'đời sống', 'văn hoá', 'giải trí', 'giới trẻ', 'giáo dục','công nghệ', 'sức khoẻ'] |
|
""" |
|
|
|
category = "thời sự" |
|
headline = "Nam thanh niên" # A full headline or only some text |
|
|
|
text = f"<|startoftext|> {category} <|headline|> {headline}" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tuanle/VN-News-GPT2") |
|
model= AutoModelForCausalLM.from_pretrained("tuanle/VN-News-GPT2").to(device) |
|
|
|
input_ids = tokenizer.encode(text, return_tensors='pt').to(device) |
|
sample_outputs = model.generate(input_ids, |
|
do_sample=True, |
|
max_length=max_len, |
|
min_length=min_len, |
|
# temperature = .8, |
|
top_k= top_k, |
|
top_p = top_p, |
|
num_beams= num_beams, |
|
early_stopping= True, |
|
no_repeat_ngram_size= 2 , |
|
num_return_sequences= num_return_sequences) |
|
|
|
for i, sample_output in enumerate(sample_outputs): |
|
temp = tokenizer.decode(sample_output.tolist()) |
|
print(f">> Generated text {i+1}\n\n{temp}") |
|
print('\n---') |
|
``` |
|
|