tuanle
/

VN-News-GPT2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

VN-News-GPT2 / README.md

tuanle's picture

Update README.md

ca358e9 almost 3 years ago

|

2.85 kB

	---
	language:
	- vi
	thumbnail: "url to a thumbnail used in social sharing"
	tags:
	- News
	- Language model
	- GPT2
	datasets:
	- Private Vietnamese News dataset
	metrics:
	- rouge
	- wer

	---


	# GPT-2 Fine-tuning With Vietnamese News
	## Model description
	A Fine-tuned Vietnamese GPT2 model which can generate Vietnamese news based on context (category + headline), based on the Vietnamese Wiki GPT2 pretrained model (https://huggingface.co/danghuy1999/gpt2-viwiki)

	## Github
	- https://github.com/Tuan-Lee-23/Vietnamese-News-Generative-Model

	## Purpose
	This model was made only for fun and experimental study. However, It gives impressive results
	Most of the generative news are fake with unconfirmed information. Honestly, I feel fun about this project =))

	## Dataset
	The dataset is about 30k Vietnamese news dataset from website thanhnien.vn

	## Result
	- Train Loss: 2.3
	- Val loss: 2.5
	- Rouge F1: 0.556
	- Word error rate: 1.08

	## Deployment
	- You can run the model deployment in this Colab's [link](https://colab.research.google.com/drive/1ITnYPnngd_aqkFB2A5IhzSsX4jQSPOR1?usp=sharing)
	- Then go to this link: https://gptvn.loca.lt
	- You can choose any categories and give it some text for the headline, then generate. There we go
	- P/s: I've already tried to deploy my model on Streamlit's cloud, but It was always being broken due to out of memory


	## Example usage
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


	"""
	Category includes: ['thời sự ', 'thế giới', 'tài chính kinh doanh', 'đời sống', 'văn hoá', 'giải trí', 'giới trẻ', 'giáo dục','công nghệ', 'sức khoẻ']
	"""

	category = "thời sự"
	headline = "Nam thanh niên" # A full headline or only some text

	text = f"<\|startoftext\|> {category} <\|headline\|> {headline}"

	tokenizer = AutoTokenizer.from_pretrained("tuanle/VN-News-GPT2")
	model= AutoModelForCausalLM.from_pretrained("tuanle/VN-News-GPT2").to(device)

	input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
	sample_outputs = model.generate(input_ids,
	do_sample=True,
	max_length=max_len,
	min_length=min_len,
	# temperature = .8,
	top_k= top_k,
	top_p = top_p,
	num_beams= num_beams,
	early_stopping= True,
	no_repeat_ngram_size= 2 ,
	num_return_sequences= num_return_sequences)

	for i, sample_output in enumerate(sample_outputs):
	temp = tokenizer.decode(sample_output.tolist())
	print(f">> Generated text {i+1}\n\n{temp}")
	print('\n---')
	```