damerajee commited on
Commit
a9e21a9
·
1 Parent(s): 97ddd2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -12
README.md CHANGED
@@ -12,6 +12,7 @@ datasets:
12
  - Aarif1430/english-to-hindi
13
  - Sampuran01/english-hindi-translation
14
  ---
 
15
  # Model Description
16
  The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was finetuned using [Unsloth](https://github.com/unslothai/unsloth)
17
 
@@ -27,30 +28,25 @@ tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
27
  model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
28
  ```
29
  ## Inference
 
 
30
  ```python
31
- inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt").to("cuda")
32
 
33
  outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
34
  tokenizer.batch_decode(outputs)
35
 
36
  ```
 
37
  ```python
38
- inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt").to("cuda")
39
 
40
  outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
41
  tokenizer.batch_decode(outputs)
42
-
43
  ```
44
 
45
- # Training details
46
- * The model was loaded in 4-Bit
47
- * The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
48
- * The training took about 2 hours approximately
49
- * The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
50
-
51
-
52
- # Dataset
53
- * The dataset used was the combination of two dataset which gave a total of 1_786_788 rows of hindi text
54
  * The rows were then pre-process to look something like this :
55
 
56
  ```python
@@ -58,3 +54,14 @@ tokenizer.batch_decode(outputs)
58
  "अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
59
  ```
60
  * This was done for both english to hindi and hindi to english hence the name h2e and e2h
 
 
 
 
 
 
 
 
 
 
 
 
12
  - Aarif1430/english-to-hindi
13
  - Sampuran01/english-hindi-translation
14
  ---
15
+
16
  # Model Description
17
  The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was finetuned using [Unsloth](https://github.com/unslothai/unsloth)
18
 
 
28
  model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
29
  ```
30
  ## Inference
31
+
32
+ ### For english to hindi(e2h)
33
  ```python
34
+ inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")
35
 
36
  outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
37
  tokenizer.batch_decode(outputs)
38
 
39
  ```
40
+ ### For hindi to english(h2e)
41
  ```python
42
+ inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")
43
 
44
  outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
45
  tokenizer.batch_decode(outputs)
 
46
  ```
47
 
48
+ # Dataset
49
+ * The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
 
 
 
 
 
 
 
50
  * The rows were then pre-process to look something like this :
51
 
52
  ```python
 
54
  "अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
55
  ```
56
  * This was done for both english to hindi and hindi to english hence the name h2e and e2h
57
+ * Now when doing the above we get a total of 3 million plus rows
58
+
59
+ # Training details
60
+ * The model was loaded in 4-Bit
61
+ * The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
62
+ * The training took about 2 hours approximately
63
+ * The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
64
+ * Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language
65
+
66
+ # Limitations
67
+ The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful