File size: 2,377 Bytes
1a860cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
language: ta
datasets:
- oscar
- IndicNLP
- Wiki-Tamil novels scrapped data

widget:
- text: 'ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார்.'

- text: 'நந்தினி பெரிய பழுவேட்டரையரை உண்மையாக நேசித்தால் '

- text: 'மதுராந்தகருக்கு இராஜ்யமாளும் விருப்பம் இருப்பதாக இல்லை '

---

# GPT2-Kalki
## Model description
GPT2-Kalki is a GPT-2 transformer model fine-tuned on corpus of Tamil language data from Wikipedia. Has been specifically finetuned on the works of [Kalki Krishnamurthy](https://en.wikipedia.org/wiki/Kalki_Krishnamurthy) - a Tamil writer from the 1900s.
This model is an experimentation of "What if" scenarios using the characters of his novels. The famous movie that has been released now [Ponniyin Selvan - I](https://en.wikipedia.org/wiki/Ponniyin_Selvan:_I) is based on the novel written by the same author. 
This model is trained on an already trained model on Tamil dataset from [GPT2-Tamil](https://huggingface.co/abinayam/gpt-2-tamil).

## Dataset Used:
The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/) and manually scrapped Wikipedia dataset specifically having stories and novels.
The scrapped dataset will be released soon.

## Usage
You can use this model for Tamil text generation:
```python
>>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained('tsaditya/GPT-Kalki')
>>> model = AutoModelWithLMHead.from_pretrained('tsaditya/GPT-Kalki')
>>> text = "ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார். "
>>> encoded_text = tokenizer.encode(text, return_tensors='tf')
>>> beam_output = model.generate(
    encoded_text,
    do_sample=True, 
    max_length=512, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1,
    no_repeat_ngram_size = 3,
    temperature = 0.7
    )
>>> print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
```
---