Update README.md

bc44ef2 over 2 years ago

5.47 kB

	---
	language: "ca"
	tags:
	- masked-lm
	- RoBERTa-base-ca-v2
	- catalan
	widget:
	- text: "El Català és una llengua molt <mask>."
	- text: "Salvador Dalí va viure a <mask>."
	- text: "La Costa Brava té les millors <mask> d'Espanya."
	- text: "El cacaolat és un batut de <mask>."
	- text: "<mask> és la capital de la Garrotxa."
	- text: "Vaig al <mask> a buscar bolets."
	- text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
	- text: "Catalunya és una referència en <mask> a nivell europeu."
	license: apache-2.0
	---

	## Model description

	RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
	It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
	and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.

	## Tokenization and pretraining

	The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
	used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
	The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
	with the same hyperparameters as in the original work.
	The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.

	## Training corpora and preprocessing

	The training corpus consists of several corpora gathered from web crawling and public corpora.


	\| Corpus \| Size in GB \|
	\|-------------------------\|------------\|
	\| BNE-ca \| 13.00 \|
	\| Wikipedia \| 1.10 \|
	\| DOGC \| 0.78 \|
	\| Catalan Open Subtitles \| 0.02 \|
	\| Catalan Oscar \| 4.00 \|
	\| CaWaC \| 3.60 \|
	\| Cat. General Crawling \| 2.50 \|
	\| Cat. Goverment Crawling \| 0.24 \|
	\| ACN \| 0.42 \|
	\| Padicat \| 0.63 \|
	\| RacoCatalá \| 8.10 \|
	\| Nació Digital \| 0.42 \|
	\| Vilaweb \| 0.06 \|
	\| Tweets \| 0.02 \|

	## Evaluation

	### CLUB benchmark

	The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
	that has been created along with the model.

	It contains the following tasks and their related datasets:

	1. Part-of-Speech Tagging (POS)

	Catalan-Ancora: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus

	2. Named Entity Recognition (NER)

	[AnCora Catalan 2.0.0](https://zenodo.org/record/4762031#.YKaFjqGxWUk): extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
	filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format

	3. Text Classification (TC)

	[TeCla](https://doi.org/10.5281/zenodo.4627197): consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus

	4. Semantic Textual Similarity (STS)

	[Catalan semantic textual similarity](https://doi.org/10.5281/zenodo.4529183): consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them,
	scraped from the [Catalan Textual Corpus](https://doi.org/10.5281/zenodo.4519349)

	5. Question Answering (QA):

	[ViquiQuAD](https://doi.org/10.5281/zenodo.4562344): consisting of more than 15,000 questions outsourced from Catalan Wikipedia randomly chosen from a set of 596 articles that were originally written in Catalan.

	[XQuAD](https://doi.org/10.5281/zenodo.4526223): the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_

	Here are the train/dev/test splits of the datasets:

	\| Task (Dataset) \| Total \| Train \| Dev \| Test \|
	\|:--\|:--\|:--\|:--\|:--\|
	\| NER (Ancora) \|13,581 \| 10,628 \| 1,427 \| 1,526 \|
	\| POS (Ancora)\| 16,678 \| 13,123 \| 1,709 \| 1,846 \|
	\| STS \| 3,073 \| 2,073 \| 500 \| 500 \|
	\| TC (TeCla) \| 137,775 \| 110,203 \| 13,786 \| 13,786\|
	\| QA (ViquiQuAD) \| 14,239 \| 11,255 \| 1,492 \| 1,429 \|

	### Results

	\| Task \| NER (F1) \| POS (F1) \| STS (Pearson) \| TC (accuracy) \| QA (ViquiQuAD) (F1/EM) \| QA (XQuAD) (F1/EM) \|
	\| ------------\|:-------------:\| -----:\|:------\|:-------\|:------\|:----\|
	\| RoBERTa-base-ca-v2 \| 89.84 \| 99.07 \| 79.98 \| 83.41 \| 88.04/74.65 \| 71.50/53.41 \|
	\| BERTa \| 88.13 \| 98.97 \| 79.73 \| 74.16 \| 86.97/72.29 \| 68.89/48.87 \|
	\| mBERT \| 86.38 \| 98.82 \| 76.34 \| 70.56 \| 86.97/72.22 \| 67.15/46.51 \|
	\| XLM-RoBERTa \| 87.66 \| 98.89 \| 75.40 \| 71.68 \| 85.50/70.47 \| 67.10/46.42 \|
	\| WikiBERT-ca \| 77.66 \| 97.60 \| 77.18 \| 73.22 \| 85.45/70.75 \| 65.21/36.60 \|

	## Intended uses & limitations
	The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
	However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.


	## Funding
	This work was funded by the Generalitat de Catalunya within the framework of the AINA language technologies plan.