mt5-small-ft-Chinese-Braille / README.md

Update README.md

8571010 verified 8 months ago

3.68 kB

	---
	library_name: transformers
	tags:
	- mt5-small
	- fine-tuning
	- chinese
	- braille
	---
	# MT5-Small-FT-Chinese-Braille
	<p align="center">
	📃 <a href="https://arxiv.org/" target="_blank">[Paper]</a> • 💻 <a href="https://github.com/AlanYWu/ChineseBrailleTranslation" target="_blank">[Github]</a> • 🤗 <a href="https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone" target="_blank">[Dataset]</a> • ⚙️ <a href="https://huggingface.co/Violet-yo/mt5-small-ft-Chinese-Braille" target="_blank">[Model]</a> • 🎬 <a href="https://vision-braille.com/" target="_blank">[Demo]</a>
	</p>

	This model is a fine-tuned version of the `mt5-small` model on the `Chinese-Braille-10per-Tone` dataset in https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone. The training code can be found in the [Github repository](https://github.com/AlanYWu/ChineseBrailleTranslation).

	## Inference
	```python
	import evaluate
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	braille_text = "⠼⠓⠙⠁⠃⠉⠊ ⠓⠶⠞⠼ ⠚⠴⠺ ⠤ ⠘ ⠌⠢ ⠛⠊ ⠝⠩ ⠳⠬ ⠊⠓⠑ ⠛⠕⠛⠫ ⠵⠪ ⠵⠼⠛⠫ ⠟⠥⠅⠷⠐ ⠊⠛⠡ ⠃⠔ ⠌⠲⠛⠕ ⠛⠩⠱⠖ ⠙⠢ ⠟⠥⠅⠷⠇⠭ ⠃⠥⠟⠲ ⠱⠦⠇⠪⠐ ⠙⠧⠱ ⠃⠡ ⠍⠮⠳ ⠙⠖ ⠛⠕⠱⠼ ⠙⠢ ⠟⠼⠙⠥ ⠐⠆"
	ground_truth = "841239\t黄腾认为： “ 这几年由于一些国家在增加出口，已经把中国减少的出口量补充上来，但是并没有到过剩的程度。\n"
	model = AutoModelForSeq2SeqLM.from_pretrained("Violet-yo/mt5-small-ft-Chinese-Braille")
	tokenizer = AutoTokenizer.from_pretrained("Violet-yo/mt5-small-ft-Chinese-Braille", use_fast=False)

	inputs = tokenizer(
	braille_text, return_tensors="pt", max_length=280, padding=True, truncation=True
	)
	output_sequences = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	max_new_tokens=300,
	num_beams=5,
	)
	translated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
	print(f"{translated_text=}")
	print(f"{ground_truth=}")
	metric = evaluate.load("models/metrics/sacrebleu")
	results = metric.compute(predictions=[translated_text], references=[[ground_truth]])
	print(f"{results=}")
	```

	The output should be:
	```text
	translated_text='841239 黄腾认为 : “ 这几年由于一些国家在增加出口, 已经把中国减少的出口量补充上来, 但是并没有到过剩的程度。'
	ground_truth='841239\t黄腾认为： “ 这几年由于一些国家在增加出口，已经把中国减少的出口量补充上来，但是并没有到过剩的程度。\n'
	results={'score': 74.00206257221929, 'counts': [29, 25, 21, 17], 'totals': [32, 31, 30, 29], 'precisions': [90.625, 80.64516129032258, 70.0, 58.62068965517241], 'bp': 1.0, 'sys_len': 32, 'ref_len': 32}
	```

	Note that we didn't provide `FastTokenizer` because we added special tokens and `FastTokenizer` will output `<UNK>` tokens. Please set `use_fast=False` when loading the tokenizer.

	## Resources
	- Homepage: [Vision-Braille](https://vision-braille.com/)
	- Repository: [Github](https://github.com/AlanYWu/ChineseBrailleTranslation)
	- Paper: [arXiv](https://arxiv.org/)
	- HuggingFace: [Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone), [Model](https://huggingface.co/Violet-yo/mt5-small-ft-Chinese-Braille)
	- [Full Tone Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-Full-Tone)
	- [No Tone Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-No-Tone)