File size: 2,030 Bytes
e6c7d4a d7a8cf6 e6c7d4a d7a8cf6 0953cd0 5b8b1b1 0953cd0 5539347 800222e d7a8cf6 0953cd0 d7a8cf6 800222e 0953cd0 800222e d7a8cf6 0953cd0 44de952 12a313d 5539347 b28798c 12a313d c9b59aa 0953cd0 c9b59aa 12a313d 0953cd0 5539347 0953cd0 c8fbe40 c7171d0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: cc-by-sa-4.0
language:
- pl
library_name: transformers
---
The presented model can be used for text de-noising.
You can use it if you have text that has noise after loading, such as after loading pdf files.
The model was learned on texts in Polish. The dataset was automatically noised.
[allegro/plt5-base](https://huggingface.co/allegro/plt5-base) was used as the base model.
**Model input**
Model input must be preceded by the tag `denoise:` F.e. if you have text:
```
As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
```
then input to the model must be constructed as follows:
```
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
```
**Sample model usage**
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
def do_inference(text, model, tokenizer):
input_text = f"denoise: {text}"
inputs = tokenizer.encode(
input_text,
return_tensors="pt",
max_length=256,
padding="max_length",
truncation=True,
)
corrected_ids = model.generate(
inputs,
max_length=256,
num_beams=5,
early_stopping=True,
)
corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True)
return corrected_sentence
model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base")
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base")
text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k."
print(do_inference(text_str, model, tokenizer))
```
Model reponse for **input**:
```
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
```
is:
```
Astronomia jest jedną z najstarszych nauk.
```
**Evaluation**
Eval loss:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/HIJI2a1nojM6lbDyYe0-A.png)
More information (in Polish) on our [blog](https://radlab.dev/2024/04/20/odszumiacz-tekstow/) |