Model Card: English–Faroese Translation Adapter

Model Details

Model Description

Developed by: Barbara Scalvini
Model type: Language model adapter for English → Faroese translation
Language(s): English, Faroese
License: This adapter inherits the license from the original Llama 3.1 8B model.
Finetuned from model: meta-llama/Meta-Llama-3.1-8B
Library used: PEFT 0.13.0

Model Sources

Paper: [COMING SOON]

Uses

Direct Use

This adapter is intended to perform English→Faroese translation, leveraging a parameter-efficient fine-tuning (PEFT) approach.

Downstream Use [optional]

Can be integrated into broader multilingual or localization workflows.

Out-of-Scope Use

Any uses that rely on languages other than English or Faroese will likely yield suboptimal results.
Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.

Bias, Risks, and Limitations

Biases: The model could reflect biases present in the training data, such as historical or societal biases in English or Faroese texts.
Recommendation: Users should critically evaluate outputs, especially in sensitive or high-stakes applications.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the trained model and tokenizer from the checkpoint
checkpoint_dir = "barbaroo/llama3.1_translate_8B"  # The directory where your trained model and tokenizer are saved
model = AutoModelForCausalLM.from_pretrained(checkpoint_dir, device_map="auto", load_in_8bit = True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
MAX_SEQ_LENGTH = 512
sentences = ["What's your name?"]

# Define the prompt template (same as in training)
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Inference loop
for sentence in sentences:
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Translate this sentence from English to Faroese:",  # Instruction
                sentence,  # The input sentence to translate
                "",  # Leave blank for generation
            )
        ],
        return_tensors="pt",
        padding=True,
        truncation=True,  # Make sure the input is not too long
        max_length=MAX_SEQ_LENGTH  # Enforce the max length if necessary
    ).to("cuda")

    # Generate the translation
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,  # Limit the number of new tokens generated
        eos_token_id=tokenizer.eos_token_id,  # Ensure EOS token is used
        pad_token_id=tokenizer.pad_token_id,  # Ensure padding token is used
        temperature=0.1,  # Sampling temperature for diversity
        top_p=1.0,  # Sampling top-p for generation
        use_cache=True  # Use cache for efficiency
    )

    # Decode the generated tokens into text
    output_string = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(f"Input: {sentence}")
    print(f"Generated Translation: {output_string}")

Training Details

Training Data

We used the Sprotin parallel corpus for English–Faroese translation: barbaroo/Sprotin_parallel.

Training Procedure

Preprocessing [optional]

Tokenization: We used the tokenizer from the base model meta-llama/Llama-3.1-8B.
The Alpaca prompt format was used, with Instruction, Input and Response.

Training Hyperparameters

Epochs: 3 total, with an early stopping criterion monitoring validation loss.
Batch Size: 2, with 4 Gradient accumulation steps
Learning Rate: 2e-4
Optimizer: AdamW with a linear learning-rate scheduler and warm-up.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the [FLORES-200] benchmark, of ~1012 English–Faroese pairs.

Metrics and Results

BLEU: [0.175]
chrF: [49.5]
BERTScore f1: [0.948]

Human evaluation was also performed (see paper)

Citation []

[COMING SOON]

Framework versions

PEFT 0.13.0

barbaroo
/

llama3.1_translate_8B