Improve README

427076e 25 days ago

9.63 kB

	---
	license: mit
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	datasets:
	- agentlans/en-translations
	base_model:
	- agentlans/multilingual-e5-small-aligned
	pipeline_tag: text-classification
	tags:
	- multilingual
	- readability-assessment
	---

	# multilingual-e5-small-aligned-readability

	This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned](https://huggingface.co/agentlans/multilingual-e5-small-aligned) designed for assessing text readability across multiple languages.

	## Key Features

	- Multilingual support
	- Readability assessment for text
	- Based on E5 small model architecture

	## Intended Uses & Limitations

	This model is intended for:
	- Assessing the readability of multilingual text
	- Filtering multilingual content
	- Comparative analysis of corpus text readability across different languages

	Limitations:
	- Performance may vary for languages not well-represented in the training data
	- Should not be used as the sole criterion for readability assessment

	## Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import numpy as np

	model_name = "agentlans/multilingual-e5-small-aligned-readability"

	# Initialize tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	def readability(text):
	"""Assess the readability of the input text."""
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
	with torch.no_grad():
	logits = model(**inputs).logits.squeeze().cpu()
	return logits.tolist()

	# Grade level conversion function
	# Input: readability value
	# Output: the equivalent U.S. education grade level
	def grade_level(y):
	lambda_, mean, sd = 0.8766912, 7.908629, 3.339119
	y_unstd = (-y) * sd + mean
	return np.power((y_unstd * lambda_ + 1), (1 / lambda_))

	# Example
	input_text = "The mitochondria is the powerhouse of the cell."
	readability_score = readability(input_text)
	grade = grade_level(readability_score)
	print(f"Predicted score: {readability_score:.2f}\nGrade: {grade:.1f}")
	```

	## Performance Results

	The model was evaluated on a diverse set of multilingual text samples:

	- 10 English text samples of varying readability were translated into Arabic, Chinese, French, Russian, and Spanish.
	- The model demonstrated consistent readability assessment across different languages for the same text.

	<details>
	<summary>Click here for the 10 original texts and their translations.</summary>

	\| Text \| English \| French \| Spanish \| Chinese \| Russian \| Arabic \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| A \| In a world increasingly dominated by technology, the delicate balance between human connection and digital interaction has become a focal point of contemporary discourse. \| Dans un monde de plus en plus dominé par la technologie, l’équilibre délicat entre la connexion humaine et l’interaction numérique est devenu un point central du discours contemporain. \| En un mundo cada vez más dominado por la tecnología, el delicado equilibrio entre la conexión humana y la interacción digital se ha convertido en un punto focal del discurso contemporáneo. \| 在一个日益受技术主导的世界里，人际联系和数字互动之间的微妙平衡已经成为当代讨论的焦点。 \| В мире, где все больше доминируют технологии, тонкий баланс между человеческими связями и цифровым взаимодействием стал центральным вопросом современного дискурса. \| في عالم تهيمن عليه التكنولوجيا بشكل متزايد، أصبح التوازن الدقيق بين التواصل البشري والتفاعل الرقمي نقطة محورية في الخطاب المعاصر. \|
	\| B \| Despite the challenges they faced, the team remained resolute in their pursuit of excellence and innovation. \| Malgré les défis auxquels elle a été confrontée, l’équipe est restée déterminée dans sa quête de l’excellence et de l’innovation. \| A pesar de los desafíos que enfrentaron, el equipo se mantuvo firme en su búsqueda de la excelencia y la innovación. \| 尽管面临挑战，该团队仍然坚定地追求卓越和创新。 \| Несмотря на трудности, с которыми пришлось столкнуться, команда сохраняла решимость в своем стремлении к совершенству и инновациям. \| وعلى الرغم من التحديات التي واجهوها، ظل الفريق مصمماً على سعيه لتحقيق التميز والابتكار. \|
	\| C \| As the storm approached, the sky turned a deep shade of gray, casting an eerie shadow over the landscape. \| À l’approche de la tempête, le ciel prenait une teinte grise profonde, projetant une ombre étrange sur le paysage. \| A medida que se acercaba la tormenta, el cielo se tornó de un gris profundo, proyectando una sombra inquietante sobre el paisaje. \| 随着暴风雨的临近，天空变成了深灰色，给大地投下了一层阴森的阴影。 \| По мере приближения шторма небо приобрело глубокий серый оттенок, отбрасывая на пейзаж жуткую тень. \| ومع اقتراب العاصفة، تحولت السماء إلى لون رمادي غامق، مما ألقى بظلال مخيفة على المشهد الطبيعي. \|
	\| D \| After a long day at work, he finally relaxed with a cup of tea. \| Après une longue journée de travail, il s'est enfin détendu avec une tasse de thé. \| Después de un largo día de trabajo, finalmente se relajó con una taza de té. \| 工作了一天之后，他终于可以喝杯茶放松一下了。 \| После долгого рабочего дня он наконец расслабился за чашкой чая. \| بعد يوم طويل في العمل، استرخى أخيرًا مع كوب من الشاي. \|
	\| E \| The quick brown fox jumps over the lazy dog. \| Le renard brun rapide saute par-dessus le chien paresseux. \| El rápido zorro marrón salta sobre el perro perezoso. \| 这只敏捷的棕色狐狸跳过了那只懒狗。 \| Быстрая бурая лиса перепрыгивает через ленивую собаку. \| يقفز الثعلب البني السريع فوق الكلب الكسول. \|
	\| F \| She enjoys reading books in her free time. \| Elle aime lire des livres pendant son temps libre. \| A ella le gusta leer libros en su tiempo libre. \| 她喜欢在空闲时间读书。 \| В свободное время она любит читать книги. \| إنها تستمتع بقراءة الكتب في وقت فراغها. \|
	\| G \| The sun is shining brightly today. \| Le soleil brille fort aujourd'hui. \| Hoy el sol brilla intensamente. \| 今天阳光明媚。 \| Сегодня ярко светит солнце. \| الشمس مشرقة بقوة اليوم. \|
	\| H \| Birds are singing in the trees. \| Les oiseaux chantent dans les arbres. \| Los pájaros cantan en los árboles. \| 鸟儿在树上唱歌。 \| Птицы поют на деревьях. \| الطيور تغرد في الأشجار. \|
	\| I \| The cat is on the mat. \| Le chat est sur le tapis. \| El gato está sobre la alfombra. \| 猫在垫子上。 \| Кот на коврике. \| القطة على الحصيرة. \|
	\| J \| I like to eat apples. \| J'aime manger des pommes. \| Me gusta comer manzanas. \| 我喜欢吃苹果。 \| Я люблю есть яблоки. \| أنا أحب أكل التفاح. \|

	</details>

	<img src="Readability.svg" alt="Scatterplot of predicted readability scores grouped by text sample and language" width="100%"/>

	## Training Data

	The model was trained on the [Multilingual Parallel Sentences dataset](https://huggingface.co/datasets/agentlans/en-translations), which includes:

	- Parallel sentences in English and various other languages
	- Semantic similarity scores calculated using LaBSE
	- Additional readability metrics
	- Sources: JW300, Europarl, TED Talks, OPUS-100, Tatoeba, Global Voices, and News Commentary

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 128
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 3.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Mse \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|
	\| 0.1484 \| 1.0 \| 7813 \| 0.1324 \| 0.1324 \|
	\| 0.1157 \| 2.0 \| 15626 \| 0.1241 \| 0.1241 \|
	\| 0.096 \| 3.0 \| 23439 \| 0.1234 \| 0.1234 \|


	### Framework versions

	- Transformers 4.46.3
	- Pytorch 2.5.1+cu124
	- Datasets 3.1.0
	- Tokenizers 0.20.3