File size: 3,248 Bytes
8e72f93 24509b9 8e72f93 97ddd2f ef20ca6 8e72f93 a9e21a9 97ddd2f d0e3ea2 8f81d38 136a57f 97ddd2f 136a57f 97ddd2f 136a57f 97ddd2f a9e21a9 97ddd2f a9e21a9 136a57f 97ddd2f a9e21a9 97ddd2f a9e21a9 97ddd2f 136a57f 6242bb9 a9e21a9 97ddd2f a9e21a9 b4f1620 1d054df |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
license: apache-2.0
language:
- en
- hi
library_name: transformers
pipeline_tag: translation
tags:
- translation
- Bilingual
datasets:
- Aarif1430/english-to-hindi
- Sampuran01/english-hindi-translation
metrics:
- bleu
---
# Model Description
This merge of lora model was finetuned using The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) using [Unsloth](https://github.com/unslothai/unsloth)
This model can translate from english to hindi and hindi to english
<img src="https://cdn-uploads.huggingface.co/production/uploads/6487239cca30096ea9f52115/Rsixw_aSB-ytZT7VEQ06c.jpeg" width="500" height="500" alt="Image">
# Steps to try the model :
## Load the model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
```
## Inference
### For english to hindi(e2h)
```python
inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)
```
### For hindi to english(h2e)
```python
inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")
outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)
```
# Dataset
* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
* The rows were then pre-process to look something like this :
```python
[INST]translate this from english to hindi: When it is said to him: \'Fear Allah\' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling! [/INST] hindi output: और जब उससे कहा जाता है,
"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
```
* This was done for both english to hindi and hindi to english hence the name h2e and e2h
* Now when doing the above we get a total of 3 million plus rows
# Training details
* The model was loaded in 4-Bit
* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
* Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language
# Limitations
The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful.
As of right now this is a smaller model Better model trained on better dataset will be released |