File size: 4,582 Bytes

---
base_model: meta-llama/Meta-Llama-3.1-8B
library_name: peft
datasets:
- barbaroo/Sprotin_parallel
language:
- en
- fo
metrics:
- bleu
- chrf
- bertscore
pipeline_tag: text-generation
---



# Model Card: English–Faroese Translation Adapter

## Model Details

**Model Description**

- **Developed by:** Barbara Scalvini
- **Model type:** Language model adapter for **English → Faroese** translation  
- **Language(s):** English, Faroese  
- **License:** This adapter inherits the license from the original Llama 3.1 8B model.
- **Finetuned from model:** [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)  
- **Library used:** [PEFT 0.13.0](https://github.com/huggingface/peft)

### Model Sources 

- **Paper:** [COMING SOON]  
---

## Uses

### Direct Use
This adapter is intended to perform **English→Faroese** translation, leveraging a **parameter-efficient fine-tuning** (PEFT) approach.

### Downstream Use [optional]
- Can be integrated into broader **multilingual** or **localization** workflows.


### Out-of-Scope Use
- Any uses that rely on languages other than **English or Faroese** will likely yield suboptimal results.
- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.

---

## Bias, Risks, and Limitations
- **Biases:** The model could reflect **biases** present in the training data, such as historical or societal biases in English or Faroese texts.
- **Recommendation:** Users should **critically evaluate** outputs, especially in sensitive or high-stakes applications.

---

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the trained model and tokenizer from the checkpoint
checkpoint_dir = "barbaroo/llama3.1_translate_8B"  # The directory where your trained model and tokenizer are saved
model = AutoModelForCausalLM.from_pretrained(checkpoint_dir, device_map="auto", load_in_8bit = True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
MAX_SEQ_LENGTH = 512
sentences = ["What's your name?"]

# Define the prompt template (same as in training)
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Inference loop
for sentence in sentences:
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Translate this sentence from English to Faroese:",  # Instruction
                sentence,  # The input sentence to translate
                "",  # Leave blank for generation
            )
        ],
        return_tensors="pt",
        padding=True,
        truncation=True,  # Make sure the input is not too long
        max_length=MAX_SEQ_LENGTH  # Enforce the max length if necessary
    ).to("cuda")

    # Generate the translation
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,  # Limit the number of new tokens generated
        eos_token_id=tokenizer.eos_token_id,  # Ensure EOS token is used
        pad_token_id=tokenizer.pad_token_id,  # Ensure padding token is used
        temperature=0.1,  # Sampling temperature for diversity
        top_p=1.0,  # Sampling top-p for generation
        use_cache=True  # Use cache for efficiency
    )

    # Decode the generated tokens into text
    output_string = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(f"Input: {sentence}")
    print(f"Generated Translation: {output_string}")
```


## Training Details

### Training Data

We used the Sprotin parallel corpus for **English–Faroese** translation: [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel). 


### Training Procedure

#### Preprocessing [optional]

- **Tokenization**: We used the tokenizer from the base model `meta-llama/Llama-3.1-8B`.
- The Alpaca prompt format was used, with Instruction, Input and Response. 

#### Training Hyperparameters
 
- **Epochs**: **3** total, with an **early stopping** criterion monitoring validation loss.  
- **Batch Size**: **2, with 4 Gradient accumulation steps**  
- **Learning Rate**: **2e-4** 
- **Optimizer**: **AdamW** with a linear learning-rate scheduler and warm-up.

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- The model was evaluated on the **[FLORES-200]** benchmark,  of ~1012 English–Faroese pairs.  


#### Metrics and Results

- **BLEU**: **[0.175]** 
- **chrF**: **[49.5]**
- **BERTScore f1**: **[0.948]**

Human evaluation was also performed (see paper)


## Citation []

[COMING SOON]

---
## Framework versions 

- PEFT 0.13.0