XLM RoBERTa for Czech+English Extractive Question Answering

This is the XLM-RoBERTa-large model with a head for extractive question answering trained on a combination of English SQuAD 1.1 and Czech SQAD 3.0 Question Answering datasets. For the Czech SQAD 3.0, original contexts (=whole Wikipedia websites) were limited to fit the RoBERTa's context window, excluding ~3% of the samples.

Intended uses & limitations

This model is purposed to extract a segment of a given context that contains an answer to a given question (Extractive Question Answering) in English and Czech. Given the fine-tuning on two languages and a good reported zero-shot cross-lingual applicability of other fine-tuned XLM-RoBERTas, the model will likely work on other languages as well, with a decay in quality.

Note that despite its size, English SQuAD has a variety of reported biases (see, e.g. L. Mikula (2022), Chap. 4.1).

Usage

Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("gaussalgo/xlm-roberta-large_extractive-QA_en-cs")
model = AutoModelForQuestionAnswering.from_pretrained("gaussalgo/xlm-roberta-large_extractive-QA_en-cs")

context = """
Podle slovenského lidového podání byl Juro Jánošík obdařen magickými předměty (kouzelná valaška, čarovný opasek),
které mu dodávaly nadpřirozené schopnosti. Okrádal především šlechtice,
trestal panské dráby a ze svého lupu vyděloval část pro chudé, tedy bohatým bral a chudým dával.
"""
question = "Jaké schopnosti daly magické předměty Juro Jánošíkovi?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
start_position = outputs.start_logits[0].argmax()
end_position = outputs.end_logits[0].argmax()
answer_ids = inputs["input_ids"][0][start_position:end_position]

print("Answer:")
print(tokenizer.decode(answer_ids))

Training

The model has been trained using Adaptor library v0.1.5, in parallel on both Czech and English data, with the following parameters:

training_arguments = AdaptationArguments(output_dir="train_dir",
learning_rate=1e-5,
stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
do_train=True,
do_eval=True,
warmup_steps=1000,
max_steps=100000,
gradient_accumulation_steps=30,
eval_steps=100,
logging_steps=10,
save_steps=1000,
num_train_epochs=30,
evaluation_strategy="steps")

You can find the full training script in train_roberta_extractive_qa.py, reproducible after a specific data preprocessing for Czech SQAD in parse_czech_squad.py

Downloads last month
32
Safetensors
Model size
559M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.