Introduction

We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from here. Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.

We observed that (after cleaning), Gandhi wrote 755468 sentences.

We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of Ai4Bharat.

Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.

Then we further fine-tuned this model on sentences written by Gandhi (for 3 epochs).

Here is the colab link with a working example.

*Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.

Downloads last month
19
Safetensors
Model size
137M params
Tensor type
F32
·
U8
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.