damerajee
/

openhathi-h2e-e2h-small

@@ -12,6 +12,7 @@ datasets:
 - Aarif1430/english-to-hindi
 - Sampuran01/english-hindi-translation
 ---
 # Model Description
 The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was  finetuned using [Unsloth](https://github.com/unslothai/unsloth)
@@ -27,30 +28,25 @@ tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
 model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
 ```
 ## Inference
 ```python
-inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt").to("cuda")
 outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
 tokenizer.batch_decode(outputs)
 ```
 ```python
-inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt").to("cuda")
 outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
 tokenizer.batch_decode(outputs)
 ```
-# Training details
-* The model was loaded in 4-Bit
-* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
-* The training took about 2 hours approximately
-* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
-# Dataset
-* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows of hindi text
 * The rows were then pre-process to look something like this :
 ```python
@@ -58,3 +54,14 @@ tokenizer.batch_decode(outputs)
 "अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
 ```
 * This was done for both english to hindi and  hindi to english hence the name h2e and e2h

 - Aarif1430/english-to-hindi
 - Sampuran01/english-hindi-translation
 ---
 # Model Description
 The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was  finetuned using [Unsloth](https://github.com/unslothai/unsloth)
 model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
 ```
 ## Inference
+### For english to hindi(e2h)
 ```python
+inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")
 outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
 tokenizer.batch_decode(outputs)
 ```
+### For hindi to english(h2e)
 ```python
+inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")
 outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
 tokenizer.batch_decode(outputs)
 ```
+ # Dataset
+* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
 * The rows were then pre-process to look something like this :
 ```python
 "अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
 ```
 * This was done for both english to hindi and  hindi to english hence the name h2e and e2h
+* Now when doing the above we get a total of 3 million plus rows
+# Training details
+* The model was loaded in 4-Bit
+* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
+* The training took about 2 hours approximately
+* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
+* Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language
+# Limitations
+The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful