Introduction

We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from here. Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.

We observed that (after cleaning), Gandhi wrote 755468 sentences.

We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of Ai4Bharat.

Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.

Then we further fine-tuned this model on sentences written by Gandhi (for 3 epochs).

Here is the colab link with a working example.

*Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.