|
--- |
|
library_name: transformers |
|
tags: |
|
- mt5-small |
|
- fine-tuning |
|
- chinese |
|
- braille |
|
--- |
|
# MT5-Small-FT-Chinese-Braille |
|
<p align="center"> |
|
📃 <a href="https://arxiv.org/" target="_blank">[Paper]</a> • 💻 <a href="https://github.com/AlanYWu/ChineseBrailleTranslation" target="_blank">[Github]</a> • 🤗 <a href="https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone" target="_blank">[Dataset]</a> • ⚙️ <a href="https://huggingface.co/Violet-yo/mt5-small-ft-Chinese-Braille" target="_blank">[Model]</a> • 🎬 <a href="https://vision-braille.com/" target="_blank">[Demo]</a> |
|
</p> |
|
|
|
This model is a fine-tuned version of the `mt5-small` model on the `Chinese-Braille-10per-Tone` dataset in https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone. The training code can be found in the [Github repository](https://github.com/AlanYWu/ChineseBrailleTranslation). |
|
|
|
## Inference |
|
```python |
|
import evaluate |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
braille_text = "⠼⠓⠙⠁⠃⠉⠊ ⠓⠶⠞⠼ ⠚⠴⠺ ⠤ ⠘ ⠌⠢ ⠛⠊ ⠝⠩ ⠳⠬ ⠊⠓⠑ ⠛⠕⠛⠫ ⠵⠪ ⠵⠼⠛⠫ ⠟⠥⠅⠷⠐ ⠊⠛⠡ ⠃⠔ ⠌⠲⠛⠕ ⠛⠩⠱⠖ ⠙⠢ ⠟⠥⠅⠷⠇⠭ ⠃⠥⠟⠲ ⠱⠦⠇⠪⠐ ⠙⠧⠱ ⠃⠡ ⠍⠮⠳ ⠙⠖ ⠛⠕⠱⠼ ⠙⠢ ⠟⠼⠙⠥ ⠐⠆" |
|
ground_truth = "841239\t黄腾 认为 : “ 这 几 年 由于 一些 国家 在 增加 出口 , 已经 把 中国 减少 的 出口量 补充 上来 , 但是 并 没有 到 过剩 的 程度 。\n" |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Violet-yo/mt5-small-ft-Chinese-Braille") |
|
tokenizer = AutoTokenizer.from_pretrained("Violet-yo/mt5-small-ft-Chinese-Braille", use_fast=False) |
|
|
|
inputs = tokenizer( |
|
braille_text, return_tensors="pt", max_length=280, padding=True, truncation=True |
|
) |
|
output_sequences = model.generate( |
|
input_ids=inputs["input_ids"], |
|
attention_mask=inputs["attention_mask"], |
|
max_new_tokens=300, |
|
num_beams=5, |
|
) |
|
translated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True) |
|
print(f"{translated_text=}") |
|
print(f"{ground_truth=}") |
|
metric = evaluate.load("models/metrics/sacrebleu") |
|
results = metric.compute(predictions=[translated_text], references=[[ground_truth]]) |
|
print(f"{results=}") |
|
``` |
|
|
|
The output should be: |
|
```text |
|
translated_text='841239 黄腾 认为 : “ 这 几 年 由于 一些 国家 在 增加 出口, 已经 把 中国 减少 的 出口量 补充 上来, 但是 并 没有 到 过剩 的 程度 。' |
|
ground_truth='841239\t黄腾 认为 : “ 这 几 年 由于 一些 国家 在 增加 出口 , 已经 把 中国 减少 的 出口量 补充 上来 , 但是 并 没有 到 过剩 的 程度 。\n' |
|
results={'score': 74.00206257221929, 'counts': [29, 25, 21, 17], 'totals': [32, 31, 30, 29], 'precisions': [90.625, 80.64516129032258, 70.0, 58.62068965517241], 'bp': 1.0, 'sys_len': 32, 'ref_len': 32} |
|
``` |
|
|
|
Note that we didn't provide `FastTokenizer` because we added special tokens and `FastTokenizer` will output `<UNK>` tokens. Please set `use_fast=False` when loading the tokenizer. |
|
|
|
## Resources |
|
- Homepage: [Vision-Braille](https://vision-braille.com/) |
|
- Repository: [Github](https://github.com/AlanYWu/ChineseBrailleTranslation) |
|
- Paper: [arXiv](https://arxiv.org/) |
|
- HuggingFace: [Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-10per-Tone), [Model](https://huggingface.co/Violet-yo/mt5-small-ft-Chinese-Braille) |
|
- [Full Tone Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-Full-Tone) |
|
- [No Tone Dataset](https://huggingface.co/datasets/Violet-yo/Chinese-Braille-Dataset-No-Tone) |
|
|