Terjman-Supreme-v2 / README.md
imomayiz's picture
Update README.md
5943bfc verified
metadata
license: cc-by-nc-4.0
base_model: atlasia/Terjman-Supreme
metrics:
  - bleu
model-index:
  - name: Terjman-Supreme-v2
    results: []
language:
  - ar
  - en

Terjman-Supreme (3.3B)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of atlasia/Terjman-Supreme on a larger and "better" dataset than the one used for the base model.

It achieves the following results on the evaluation set:

  • Loss: 0.6278
  • Bleu: 30.3749
  • Gen Len: 17.0684

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BounharAbdelaziz/Terjman-Supreme-v2")
model = AutoModelForSeq2SeqLM.from_pretrained("BounharAbdelaziz/Terjman-Supreme-v2")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Here is an example of translating English input to Moroccan Darija:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "أهلا صاحبي، واش تقدر تقول لي نكتة بالدارجة المغربية؟ غادي نكون فرحان باش نسمعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer and sometimes doesn't work well with masculin and feminin when it is not clear from the context (see the example above). We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-04
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
0.6753 1.0000 54637 0.6586 25.9586 17.3736
0.4205 2.0000 109274 0.5992 29.0316 17.0512
0.3246 3.0000 163911 0.6129 29.8734 16.9746
0.2611 4.0 218549 0.6250 30.411 17.0806
0.2609 5.0000 273185 0.6278 30.3749 17.0684

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1