Fairseq
Spanish
Catalan
fdelucaf commited on
Commit
419c193
·
verified ·
1 Parent(s): a06f56c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -45
README.md CHANGED
@@ -3,39 +3,17 @@ license: apache-2.0
3
  language:
4
  - es
5
  - ca
 
 
 
6
  ---
7
  ## Aina Project's Spanish-Catalan machine translation model
8
-
9
- ## Table of Contents
10
- - [Model Description](#model-description)
11
- - [Intended Uses and Limitations](#intended-use)
12
- - [How to Use](#how-to-use)
13
- - [Training](#training)
14
- - [Training data](#training-data)
15
- - [Training procedure](#training-procedure)
16
- - [Data Preparation](#data-preparation)
17
- - [Tokenization](#tokenization)
18
- - [Hyperparameters](#hyperparameters)
19
- - [Evaluation](#evaluation)
20
- - [Variable and Metrics](#variable-and-metrics)
21
- - [Evaluation Results](#evaluation-results)
22
- - [Additional Information](#additional-information)
23
- - [Author](#author)
24
- - [Contact Information](#contact-information)
25
- - [Copyright](#copyright)
26
- - [Licensing Information](#licensing-information)
27
- - [Funding](#funding)
28
- - [Disclaimer](#disclaimer)
29
 
30
  ## Model description
31
 
32
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
33
- up to 92 million sentences. Additionally, the model was evaluated on several public datasets pertaining to 5 different domains, specifically:
34
- general,
35
- adminstrative,
36
- technology,
37
- biomedical,
38
- and news.
39
 
40
  ## Intended uses and limitations
41
 
@@ -55,7 +33,7 @@ Translate a sentence using python
55
  import ctranslate2
56
  import pyonmttok
57
  from huggingface_hub import snapshot_download
58
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-es-ca", revision="main")
59
 
60
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
61
  tokenized=tokenizer.tokenize("Bienvenido al Proyecto Aina!")
@@ -65,6 +43,10 @@ translated = translator.translate_batch([tokenized[0]])
65
  print(tokenizer.detokenize(translated[0][0]['tokens']))
66
  ```
67
 
 
 
 
 
68
  ## Training
69
 
70
  ### Training data
@@ -139,7 +121,7 @@ We use the BLEU score for evaluation on following test sets:
139
  [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
140
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
141
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
142
- [wmt19 biomedical test set](),
143
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/)
144
 
145
  ### Evaluation results
@@ -147,7 +129,7 @@ We use the BLEU score for evaluation on following test sets:
147
  Below are the evaluation results on the machine translation from Spanish to Catalan
148
  compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
149
 
150
- | Test set | SoftCatalà | Google Translate | mt-aina-es-ca |
151
  |----------------------|------------|------------------|---------------|
152
  | Spanish Constitution | **63,6** | 61,7 | 63,0 |
153
  | United Nations | 73,8 | 74,8 | **74,9** |
@@ -162,30 +144,34 @@ compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](ht
162
  ## Additional information
163
 
164
  ### Author
165
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
166
 
167
- ### Contact information
168
- For further information, please send an email to [email protected].
169
 
170
  ### Copyright
171
- Language Technologies Unit at Barcelona Supercomputing Center (2023).
172
 
173
-
174
- ### Licensing Information
175
- This work is licensed under an [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
176
 
177
  ### Funding
178
  This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
179
 
180
- ## Limitations and Bias
181
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
182
-
183
- ## Disclaimer
184
 
185
  <details>
186
  <summary>Click to expand</summary>
187
 
188
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
189
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
190
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
 
 
 
 
 
 
 
 
191
  </details>
 
3
  language:
4
  - es
5
  - ca
6
+ metrics:
7
+ - bleu
8
+ library_name: fairseq
9
  ---
10
  ## Aina Project's Spanish-Catalan machine translation model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Model description
13
 
14
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
15
+ up to 92 million sentences. Additionally, the model is evaluated on several public datasecomprising 5 different domains (general, adminstrative, technology,
16
+ biomedical, and news).
 
 
 
 
17
 
18
  ## Intended uses and limitations
19
 
 
33
  import ctranslate2
34
  import pyonmttok
35
  from huggingface_hub import snapshot_download
36
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-es-ca", revision="main")
37
 
38
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
39
  tokenized=tokenizer.tokenize("Bienvenido al Proyecto Aina!")
 
43
  print(tokenizer.detokenize(translated[0][0]['tokens']))
44
  ```
45
 
46
+ ## Limitations and bias
47
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
48
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
49
+
50
  ## Training
51
 
52
  ### Training data
 
121
  [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
122
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
123
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
124
+ [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
125
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/)
126
 
127
  ### Evaluation results
 
129
  Below are the evaluation results on the machine translation from Spanish to Catalan
130
  compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
131
 
132
+ | Test set | SoftCatalà | Google Translate | aina-translator-es-ca |
133
  |----------------------|------------|------------------|---------------|
134
  | Spanish Constitution | **63,6** | 61,7 | 63,0 |
135
  | United Nations | 73,8 | 74,8 | **74,9** |
 
144
  ## Additional information
145
 
146
  ### Author
147
+ The Language Technologies Unit from Barcelona Supercomputing Center.
148
 
149
+ ### Contact
150
+ For further information, please send an email to <[email protected]>.
151
 
152
  ### Copyright
153
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
154
 
155
+ ### License
156
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
157
 
158
  ### Funding
159
  This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
160
 
161
+ ### Disclaimer
 
 
 
162
 
163
  <details>
164
  <summary>Click to expand</summary>
165
 
166
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
167
+
168
+ Be aware that the model may have biases and/or any other undesirable distortions.
169
+
170
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
171
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
172
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
173
+
174
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
175
+ be liable for any results arising from the use made by third parties.
176
+
177
  </details>