File size: 2,030 Bytes
e6c7d4a
 
d7a8cf6
 
 
e6c7d4a
d7a8cf6
 
0953cd0
 
5b8b1b1
 
0953cd0
 
 
 
5539347
800222e
d7a8cf6
0953cd0
d7a8cf6
800222e
0953cd0
800222e
d7a8cf6
0953cd0
44de952
 
12a313d
5539347
b28798c
12a313d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9b59aa
 
 
 
 
0953cd0
c9b59aa
 
12a313d
 
0953cd0
 
 
 
 
 
 
 
 
 
 
5539347
0953cd0
 
c8fbe40
c7171d0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc-by-sa-4.0
language:
- pl
library_name: transformers
---


The presented model can be used for text de-noising. 
You can use it if you have text that has noise after loading, such as after loading pdf files.

The model was learned on texts in Polish. The dataset was automatically noised.
[allegro/plt5-base](https://huggingface.co/allegro/plt5-base) was used as the base model.


**Model input**

Model input must be preceded by the tag `denoise:` F.e. if you have text:
```
As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.
```

then input to the model must be constructed as follows:

```
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.
```

**Sample model usage**

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer


def do_inference(text, model, tokenizer):
    input_text = f"denoise: {text}"
    inputs = tokenizer.encode(
        input_text,
        return_tensors="pt",
        max_length=256,
        padding="max_length",
        truncation=True,
    )

    corrected_ids = model.generate(
        inputs,
        max_length=256,
        num_beams=5,
        early_stopping=True,
    )

    corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True)
    return corrected_sentence


model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base")
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base")

text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k."
print(do_inference(text_str, model, tokenizer))

```

Model reponse for **input**:
```
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.
```
is:
```
Astronomia jest jedną z najstarszych nauk.
```


**Evaluation**

Eval loss:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/HIJI2a1nojM6lbDyYe0-A.png)

More information (in Polish) on our [blog](https://radlab.dev/2024/04/20/odszumiacz-tekstow/)