|
--- |
|
license: cc-by-sa-4.0 |
|
language: |
|
- pl |
|
library_name: transformers |
|
--- |
|
|
|
|
|
The presented model can be used for text de-noising. |
|
You can use it if you have text that has noise after loading, such as after loading pdf files. |
|
|
|
The model was learned on texts in Polish. The dataset was automatically noised. |
|
[allegro/plt5-base](https://huggingface.co/allegro/plt5-base) was used as the base model. |
|
|
|
|
|
**Model input** |
|
|
|
Model input must be preceded by the tag `denoise:` F.e. if you have text: |
|
``` |
|
As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. |
|
``` |
|
|
|
then input to the model must be constructed as follows: |
|
|
|
``` |
|
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. |
|
``` |
|
|
|
**Sample model usage** |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
|
|
def do_inference(text, model, tokenizer): |
|
input_text = f"denoise: {text}" |
|
inputs = tokenizer.encode( |
|
input_text, |
|
return_tensors="pt", |
|
max_length=256, |
|
padding="max_length", |
|
truncation=True, |
|
) |
|
|
|
corrected_ids = model.generate( |
|
inputs, |
|
max_length=256, |
|
num_beams=5, |
|
early_stopping=True, |
|
) |
|
|
|
corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True) |
|
return corrected_sentence |
|
|
|
|
|
model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base") |
|
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base") |
|
|
|
text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k." |
|
print(do_inference(text_str, model, tokenizer)) |
|
|
|
``` |
|
|
|
Model reponse for **input**: |
|
``` |
|
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. |
|
``` |
|
is: |
|
``` |
|
Astronomia jest jedną z najstarszych nauk. |
|
``` |
|
|
|
|
|
**Evaluation** |
|
|
|
Eval loss: |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/HIJI2a1nojM6lbDyYe0-A.png) |
|
|
|
More information (in Polish) on our [blog](https://radlab.dev/2024/03/05/odszumiacz-tekstow/) |