|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- projecte-aina/ES-OC_Parallel_Corpus |
|
language: |
|
- es |
|
- oc |
|
metrics: |
|
- bleu |
|
- chrf |
|
library_name: transformers |
|
base_model: |
|
- facebook/nllb-200-distilled-600M |
|
--- |
|
## Projecte Aina’s Spanish-Aranese machine translation model |
|
|
|
## Model description |
|
|
|
This model was created as part of the participation of Language Technologies Unit at BSC in the WMT24 Shared Task: |
|
[Translation into Low-Resource Languages of Spain](https://www2.statmt.org/wmt24/romance-task.html). |
|
It results from a full fine-tuning of the NLLB-200-600M model with a Spanish-Aranese corpus. |
|
Specifically, we used the [transformers library](https://huggingface.co/docs/transformers/) from Hugging Face and a filtered version |
|
of the [Spanish-Aranese dataset](https://huggingface.co/datasets/projecte-aina/ES-OC_Parallel_Corpus) to fine-tune the model. |
|
Since the original NLLB-200-600M doesn't support Aranese, we added a new token ("arn_Latn") to enable translation into Aranese. |
|
This language tag helps the model recognize the source and target languages for translation. |
|
The model was evaluated using the Flores+ evaluation datasets. Please refer to the [paper](__poner_link___) for more information. |
|
|
|
## Intended uses and limitations |
|
|
|
You can use this model for machine translation from Spanish to Aranese. |
|
|
|
## Limitations and bias |
|
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. |
|
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. |
|
|
|
## Evaluation |
|
|
|
### Variable and metrics |
|
|
|
We use the BLEU and ChrF score for evaluation on the [Flores+](https://github.com/openlanguagedata/flores) evaluation datasets. |
|
|
|
### Evaluation results |
|
|
|
Below are the evaluation results on the machine translation from Spanish to Aranese compared to [Apertium](https://www.apertium.org/) and [Softcatala](https://www.softcatala.org/traductor/) (cascading through Catalan): |
|
|
|
| Test set (BLEU) | Apertium | Softcatala | Our model | |
|
|:---------------------|:---------|:-------|:-----------| |
|
| Flores dev | 48.96 | 34.43 | **55.50** | |
|
| Flores devtest | 28.85 | 26.07 | **30.12** | |
|
|
|
| Test set (ChrF) | Apertium | Softcatala | Our model | |
|
|:---------------------|:---------|:-------|:-----------| |
|
| Flores dev | 72.63 | 58.61 | **76.04** | |
|
| Flores devtest | 49.42 | 48.29 | **50.05** | |
|
|
|
|
|
## Additional information |
|
|
|
## Paper |
|
For further information, please refer to the [paper](__poner_link___) published for the Shared Task: Translation into Low-Resource Languages of Spain (WMT24) |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <[email protected]>. |
|
|
|
### Copyright |
|
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/) |
|
|
|
### Funding |
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335, 2022/TL22/00215334. |
|
|
|
The publication is part of the project PID2021-123988OB-C33, funded by MCIN/AEI/10.13039/501100011033/FEDER, EU. |
|
|
|
|
|
### Disclaimer |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a CC BY-NC 4.0 license. |
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
be liable for any results arising from the use made by third parties. |
|
|
|
</details> |