--- license: apache-2.0 language: - en - hi library_name: transformers pipeline_tag: translation tags: - translation - Bilingual datasets: - Aarif1430/english-to-hindi - Sampuran01/english-hindi-translation metrics: - bleu --- # Model Description This merge of lora model was finetuned using The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) using [Unsloth](https://github.com/unslothai/unsloth) This model can translate from english to hindi and hindi to english Image # Steps to try the model : ## Load the model ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h") model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h") ``` ## Inference ### For english to hindi(e2h) ```python inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST] hindi output:"]*1, return_tensors = "pt") outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True) tokenizer.batch_decode(outputs) ``` ### For hindi to english(h2e) ```python inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST] english output:"]*1, return_tensors = "pt") outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True) tokenizer.batch_decode(outputs) ``` # Dataset * The dataset used was the combination of two dataset which gave a total of 1_786_788 rows * The rows were then pre-process to look something like this : ```python [INST]translate this from english to hindi: When it is said to him: \'Fear Allah\' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling! [/INST] hindi output: और जब उससे कहा जाता है, "अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! ' ``` * This was done for both english to hindi and hindi to english hence the name h2e and e2h * Now when doing the above we get a total of 3 million plus rows # Training details * The model was loaded in 4-Bit * The target modules include "q_proj", "k_proj", "v_proj", "o_proj" * The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this) * Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language # Limitations The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful. As of right now this is a smaller model Better model trained on better dataset will be released