BERTić-COMtext-SR-legal-lemma-ijekavica

BERTić-COMtext-SR-legal-lemma-ijekavica is a variant of the BERTić model, fine-tuned on the task of lemmatization tag prediction in Serbian legal texts written in the Ijekavian pronunciation. The model was fine-tuned for 20 epochs on the Ijekavian variant of the COMtext.SR.legal dataset.

Benchmarking

This model was evaluated on the task of lemmatizing Serbian legal texts. Lemmatization was performed using the predicted string edit tags, as described in this JTDH 2024 paper:

Lemmatizing Serbian and Croatian via String Edit Prediction

The model was compared to previous lemmatization approaches that relied on the hrLex inflectional lexicon:

The CLASSLA library
A variant of BERTić fine-tuned for MSD prediction using the SETimes.SR 2.0 corpus of newswire texts
A variant of BERTić fine-tuned for MSD prediction using the COMtext.SR.legal corpus of legal texts
SrBERTa, a model specially trained on Serbian legal texts, fine-tuned for MSD prediction using the COMtext.SR.legal corpus of legal texts

Accuracy was used as the evaluation metric and gold tokenized text was taken as input. All of the previous large language models were fine-tuned for 15 epochs. CLASSLA and BERTić-SETimes were directly tested on the entire COMtext.SR.legal.ijekavica corpus. BERTić-COMtext-SR-legal-MSD-ijekavica, BERTić-COMtext-SR-legal-lemma-ijekavica, and SrBERTa were fine-tuned and evaluated on the COMtext.SR.legal.ijekavica corpus using 10-fold CV.

The code and data to run these experiments is available on the COMtext.SR GitHub repository.

Results

Model	Lemma ACC
CLASSLA-SR	0.9036
CLASSLA-HR	0.9353
BERTić-SETimes.SR	0.9412
BERTić-COMtext-SR-legal-MSD-ijekavica	0.9429
SrBERTa	0.9187
BERTić-COMtext-SR-legal-lemma-ijekavica	0.9833

ICEF-NLP
/

bcms-bertic-comtext-sr-legal-lemma-ijekavica

BERTić-COMtext-SR-legal-lemma-ijekavica

Benchmarking

Results

Model tree for ICEF-NLP/bcms-bertic-comtext-sr-legal-lemma-ijekavica