Update README.md
Browse files
README.md
CHANGED
@@ -89,7 +89,7 @@ The model was trained on a combination of the following datasets:
|
|
89 |
|
90 |
### Data preparation
|
91 |
|
92 |
-
All datasets
|
93 |
and cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
|
94 |
|
95 |
Before training, the punctuation was normalized using a modified version of the join-single-file.py script from
|
@@ -132,7 +132,7 @@ Weights were saved every 1000 updates and reported results are the average of th
|
|
132 |
|
133 |
## Evaluation
|
134 |
|
135 |
-
###
|
136 |
|
137 |
We use the BLEU score for evaluation on following test sets:
|
138 |
[Flores-101](https://github.com/facebookresearch/flores),
|
@@ -168,14 +168,14 @@ Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
|
|
168 |
For further information, please send an email to [email protected].
|
169 |
|
170 |
### Copyright
|
171 |
-
|
172 |
|
173 |
|
174 |
### Licensing Information
|
175 |
-
This work is licensed under
|
176 |
|
177 |
### Funding
|
178 |
-
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project]
|
179 |
|
180 |
## Disclaimer
|
181 |
|
|
|
89 |
|
90 |
### Data preparation
|
91 |
|
92 |
+
All datasets were concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata)
|
93 |
and cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
|
94 |
|
95 |
Before training, the punctuation was normalized using a modified version of the join-single-file.py script from
|
|
|
132 |
|
133 |
## Evaluation
|
134 |
|
135 |
+
### Variables and metrics
|
136 |
|
137 |
We use the BLEU score for evaluation on following test sets:
|
138 |
[Flores-101](https://github.com/facebookresearch/flores),
|
|
|
168 |
For further information, please send an email to [email protected].
|
169 |
|
170 |
### Copyright
|
171 |
+
Language Technologies Unit at Barcelona Supercomputing Center (2023).
|
172 |
|
173 |
|
174 |
### Licensing Information
|
175 |
+
This work is licensed under an [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
176 |
|
177 |
### Funding
|
178 |
+
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
|
179 |
|
180 |
## Disclaimer
|
181 |
|