damerajee's picture
Update README.md
ef20ca6 verified
metadata
license: apache-2.0
language:
  - en
  - hi
library_name: transformers
pipeline_tag: translation
tags:
  - translation
  - Bilingual
datasets:
  - Aarif1430/english-to-hindi
  - Sampuran01/english-hindi-translation
metrics:
  - bleu

Model Description

This merge of lora model was finetuned using The base model sarvamai/OpenHathi-7B-Hi-v0.1-Base using Unsloth This model can translate from english to hindi and hindi to english Image

Steps to try the model :

Load the model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")

Inference

For english to hindi(e2h)

inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")

outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)

For hindi to english(h2e)

inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")

outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)

Dataset

  • The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
  • The rows were then pre-process to look something like this :
[INST]translate this from english to hindi: When it is said to him: \'Fear Allah\' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling! [/INST] hindi output: और जब उससे कहा जाता है,
"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
  • This was done for both english to hindi and hindi to english hence the name h2e and e2h
  • Now when doing the above we get a total of 3 million plus rows

Training details

  • The model was loaded in 4-Bit
  • The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
  • The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
  • Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language

Limitations

The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful.

As of right now this is a smaller model Better model trained on better dataset will be released