Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,7 @@ datasets:
|
|
12 |
- Aarif1430/english-to-hindi
|
13 |
- Sampuran01/english-hindi-translation
|
14 |
---
|
|
|
15 |
# Model Description
|
16 |
The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was finetuned using [Unsloth](https://github.com/unslothai/unsloth)
|
17 |
|
@@ -27,30 +28,25 @@ tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
|
|
27 |
model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
|
28 |
```
|
29 |
## Inference
|
|
|
|
|
30 |
```python
|
31 |
-
inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")
|
32 |
|
33 |
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
|
34 |
tokenizer.batch_decode(outputs)
|
35 |
|
36 |
```
|
|
|
37 |
```python
|
38 |
-
inputs = tokenizer(["[INST]translate this from
|
39 |
|
40 |
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
|
41 |
tokenizer.batch_decode(outputs)
|
42 |
-
|
43 |
```
|
44 |
|
45 |
-
#
|
46 |
-
* The
|
47 |
-
* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
|
48 |
-
* The training took about 2 hours approximately
|
49 |
-
* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
|
50 |
-
|
51 |
-
|
52 |
-
# Dataset
|
53 |
-
* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows of hindi text
|
54 |
* The rows were then pre-process to look something like this :
|
55 |
|
56 |
```python
|
@@ -58,3 +54,14 @@ tokenizer.batch_decode(outputs)
|
|
58 |
"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
|
59 |
```
|
60 |
* This was done for both english to hindi and hindi to english hence the name h2e and e2h
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
- Aarif1430/english-to-hindi
|
13 |
- Sampuran01/english-hindi-translation
|
14 |
---
|
15 |
+
|
16 |
# Model Description
|
17 |
The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) was finetuned using [Unsloth](https://github.com/unslothai/unsloth)
|
18 |
|
|
|
28 |
model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
|
29 |
```
|
30 |
## Inference
|
31 |
+
|
32 |
+
### For english to hindi(e2h)
|
33 |
```python
|
34 |
+
inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")
|
35 |
|
36 |
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
|
37 |
tokenizer.batch_decode(outputs)
|
38 |
|
39 |
```
|
40 |
+
### For hindi to english(h2e)
|
41 |
```python
|
42 |
+
inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")
|
43 |
|
44 |
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
|
45 |
tokenizer.batch_decode(outputs)
|
|
|
46 |
```
|
47 |
|
48 |
+
# Dataset
|
49 |
+
* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
* The rows were then pre-process to look something like this :
|
51 |
|
52 |
```python
|
|
|
54 |
"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
|
55 |
```
|
56 |
* This was done for both english to hindi and hindi to english hence the name h2e and e2h
|
57 |
+
* Now when doing the above we get a total of 3 million plus rows
|
58 |
+
|
59 |
+
# Training details
|
60 |
+
* The model was loaded in 4-Bit
|
61 |
+
* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
|
62 |
+
* The training took about 2 hours approximately
|
63 |
+
* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
|
64 |
+
* Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language
|
65 |
+
|
66 |
+
# Limitations
|
67 |
+
The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful
|