|
import gradio as gr |
|
|
|
from transformers import pipeline |
|
|
|
title = "Automatic Readability Assessment of Texts in Spanish" |
|
|
|
description = """ |
|
Is a text **complex** or **simple**? Can it be understood by someone learning Spanish with a **basic**, **intermediate** or **advanced** knowledge of the language? Find out with our models below! |
|
""" |
|
|
|
article = """ |
|
|
|
### What's Readability Assessment? |
|
|
|
[Automatic Readability Assessment](https://arxiv.org/abs/2105.00973) consists of determining "how difficult" it could be to read and understand a piece of text. |
|
This could be estimated using readability formulas, such as [Flesch for English](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) or [similar ones for Spanish](https://www.siicsalud.com/imagenes/blancopet1.pdf). |
|
However, their dependance on surface statistics (e.g. average sentence length) makes them unreliable. |
|
As such, developing models that could estimate a text's readability by "looking beyond the surface" is a necessity. |
|
|
|
### Goal |
|
|
|
We aim to contribute to the development of **neural models for readability assessment for Spanish**, following previous work for [English](https://aclanthology.org/2021.cl-1.6/) and [Filipino](https://aclanthology.org/2021.ranlp-1.69/). |
|
|
|
|
|
### Dataset |
|
|
|
We curated a new dataset that combines existing corpora for readability assessment (i.e. [Newsela](https://newsela.com/data)) and texts scraped from webpages aimed at learners of Spanish as a second language. Texts in the Newsela corpus contain the grade level (according to the USA educational system) that they were written for. In the case of scraped texts, we selected webpages that explicitly indicated the [CEFR](https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages) level that each text belongs to. |
|
|
|
Each text has two readability labels, according to the following mapping: |
|
|
|
| | 2-class | 3-class | |
|
|------------------|--------------|--------------|-----------------|-----------------|------------------| |
|
| | *Simple* | *Complex* | *Basic* | *Intermediate* | *Advanced* | |
|
| With CERF Levels | A1, A2, B1 | B2, C1, C2 | A1, A2 | B1,B2 | C1,C2 | |
|
| Newsela Corpus | Versions 3-4 | Versions 0-1 | Grade Level 2-5 | Grade Level 6-8 | Grade Level 9-12 | |
|
|
|
|
|
In addition, texts in the dataset could be too long to fit in a model. As such, we created two versions of the dataset, dividing each text into [sentences](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences) and [paragraphs](https://huggingface.co/datasets/hackathon-pln-es/readability-es-paragraphs). |
|
|
|
We also scraped several texts from the ["Corpus de Aprendices del Español" (CAES)](http://galvan.usc.es/caes/). However, due to the time constraints, we leave experiments with it for future work. The data is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-caes). |
|
|
|
### Models |
|
|
|
Our models are based on [BERTIN](https://huggingface.co/bertin-project). We fine-tuned [bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) in the different versions of our collected dataset. The following models are available: |
|
|
|
- [2-class sentence-level ](https://huggingface.co/hackathon-pln-es/readability-es-sentences) |
|
- [2-class paragraph-level ](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs) |
|
- [3-class sentence-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences) |
|
- [3-class paragraph-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-paragraphs) |
|
|
|
More details about how we trained these models can be found in our [report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx). |
|
|
|
### Team |
|
|
|
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/) |
|
- [Pedro Cuenca](https://twitter.com/pcuenq/) |
|
- [Sergio Morales](https://www.fireblend.com/) |
|
- [Fernando Alva-Manchego](https://feralvam.github.io/) |
|
|
|
""" |
|
|
|
examples = [ |
|
["Esta es una frase simple.", "simple or complex?"], |
|
["La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.", "simple or complex?"], |
|
["Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros.", "basic, intermediate, or advanced?"], |
|
["Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces.", "basic, intermediate, or advanced?"], |
|
["El turismo en Costa Rica es uno de los principales sectores económicos y de más rápido crecimiento del país.", "basic, intermediate, or advanced?"], |
|
] |
|
|
|
|
|
model_binary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-sentences", return_all_scores=True) |
|
model_ternary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-3class-paragraphs", return_all_scores=True) |
|
|
|
def predict(text, levels): |
|
if levels == 0: |
|
predicted_scores = model_binary(text)[0] |
|
else: |
|
predicted_scores = model_ternary(text)[0] |
|
|
|
output_scores = {} |
|
for e in predicted_scores: |
|
output_scores[e['label']] = e['score'] |
|
|
|
return output_scores |
|
|
|
|
|
iface = gr.Interface( |
|
fn=predict, |
|
inputs=[ |
|
gr.inputs.Textbox(lines=7, placeholder="Write a text in Spanish or choose of the examples below.", label="Text in Spanish"), |
|
gr.inputs.Radio(choices=["simple or complex?", "basic, intermediate, or advanced?"], type="index", label="Readability Levels"), |
|
], |
|
outputs=[ |
|
gr.outputs.Label(num_top_classes=3, label="Predicted Readability Level") |
|
], |
|
theme="huggingface", |
|
title = title, description = description, article = article, examples=examples, |
|
allow_flagging="never", |
|
) |
|
iface.launch() |