HugoLaurencon commited on
Commit
d1e3e7b
·
1 Parent(s): 22701ae
app.py CHANGED
@@ -13,6 +13,8 @@ import numpy as np
13
 
14
  import matplotlib.pyplot as plt
15
 
 
 
16
 
17
  class Visualization:
18
  def __init__(
@@ -390,6 +392,9 @@ class Visualization:
390
  ax.set_ylabel("frequency in the documents")
391
  st.pyplot(fig)
392
 
 
 
 
393
  def download_data(self):
394
  st.header("Download data")
395
 
@@ -408,6 +413,7 @@ class Visualization:
408
  self.filtering_of_words()
409
  self.plot_distributions_filtering_parameters()
410
  #self.plot_zipf_law()
 
411
  self.download_data()
412
 
413
 
 
13
 
14
  import matplotlib.pyplot as plt
15
 
16
+ from filtering import Filtering
17
+
18
 
19
  class Visualization:
20
  def __init__(
 
392
  ax.set_ylabel("frequency in the documents")
393
  st.pyplot(fig)
394
 
395
+ def check_personal_doc(self):
396
+ pass
397
+
398
  def download_data(self):
399
  st.header("Download data")
400
 
 
413
  self.filtering_of_words()
414
  self.plot_distributions_filtering_parameters()
415
  #self.plot_zipf_law()
416
+ self.check_personal_doc()
417
  self.download_data()
418
 
419
 
badwords.py ADDED
@@ -0,0 +1,2682 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Merge
2
+ # https://github.com/zacanger/profane-words
3
+ # and
4
+ # https://github.com/thisandagain/washyourmouthoutwithsoap/blob/develop/data/build.json
5
+ # and
6
+ # https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
7
+
8
+
9
+ english_badwords = [
10
+ "abuse",
11
+ "anal",
12
+ "anilingus",
13
+ "anus",
14
+ "aroused",
15
+ "arse",
16
+ "arsehole",
17
+ "ass",
18
+ "asses",
19
+ "assfuck",
20
+ "asshat",
21
+ "asshole",
22
+ "assholes",
23
+ "autoerotic",
24
+ "bangbros",
25
+ "banging",
26
+ "bareback",
27
+ "bastard",
28
+ "bastards",
29
+ "bazongas",
30
+ "bbw",
31
+ "bdsm",
32
+ "biatch",
33
+ "bicurious",
34
+ "bigass",
35
+ "bigtits",
36
+ "bimbo",
37
+ "bimbos",
38
+ "bitch",
39
+ "bitches",
40
+ "bitching",
41
+ "blowjob",
42
+ "blowjobs",
43
+ "boche",
44
+ "boner",
45
+ "boners",
46
+ "boob",
47
+ "boobies",
48
+ "boobs",
49
+ "booty",
50
+ "brothel",
51
+ "buceta",
52
+ "bugger",
53
+ "buggered",
54
+ "buggery",
55
+ "bukkake",
56
+ "bule",
57
+ "buttcheeks",
58
+ "buttfuck",
59
+ "butthead",
60
+ "butthole",
61
+ "buttplug",
62
+ "cameltoe",
63
+ "camgirl",
64
+ "camwhore",
65
+ "chink",
66
+ "chinks",
67
+ "cialis",
68
+ "clit",
69
+ "clitoris",
70
+ "clits",
71
+ "clitty",
72
+ "clusterfuck",
73
+ "cock",
74
+ "cock-head",
75
+ "cockblock",
76
+ "cockfight",
77
+ "cockhead",
78
+ "cocks",
79
+ "cocksman",
80
+ "cocksucker",
81
+ "cocksucking",
82
+ "coital",
83
+ "coitus",
84
+ "coochie",
85
+ "cooly",
86
+ "coon",
87
+ "coons",
88
+ "copulate",
89
+ "cowgirl",
90
+ "crabs",
91
+ "creampie",
92
+ "cum",
93
+ "cumming",
94
+ "cums",
95
+ "cumshot",
96
+ "cumshots",
97
+ "cumslut",
98
+ "cunnilingus",
99
+ "cunny",
100
+ "cunt",
101
+ "cunts",
102
+ "cybersex",
103
+ "darkey",
104
+ "darkie",
105
+ "darkies",
106
+ "darky",
107
+ "deepthroat",
108
+ "deepthroating",
109
+ "dick",
110
+ "dickhole",
111
+ "dicks",
112
+ "dildo",
113
+ "dildos",
114
+ "dogging",
115
+ "doggy-style",
116
+ "doggystyle",
117
+ "dominatrix",
118
+ "dommes",
119
+ "dong",
120
+ "dp",
121
+ "dupa",
122
+ "dyke",
123
+ "dykes",
124
+ "ecchi",
125
+ "ejaculate",
126
+ "ejaculated",
127
+ "ejaculates",
128
+ "ejaculating",
129
+ "ejaculation",
130
+ "ejaculations",
131
+ "enema",
132
+ "erect",
133
+ "erection",
134
+ "ero",
135
+ "erotic",
136
+ "erotism",
137
+ "escort",
138
+ "fag",
139
+ "fagging",
140
+ "faggot",
141
+ "fagot",
142
+ "fagots",
143
+ "fags",
144
+ "felch",
145
+ "fellate",
146
+ "fellatio",
147
+ "femdom",
148
+ "fetish",
149
+ "figging",
150
+ "fingerbang",
151
+ "fingering",
152
+ "fisted",
153
+ "fister",
154
+ "fisting",
155
+ "floozy",
156
+ "fondle",
157
+ "footfetish",
158
+ "footjob",
159
+ "foreskin",
160
+ "fornicate",
161
+ "foursome",
162
+ "fuck",
163
+ "fuckable",
164
+ "fuckbook",
165
+ "fuckboy",
166
+ "fuckbuddy",
167
+ "fucked",
168
+ "fucker",
169
+ "fuckers",
170
+ "fuckfest",
171
+ "fuckhole",
172
+ "fuckin",
173
+ "fucking",
174
+ "fucks",
175
+ "fuk",
176
+ "fukin",
177
+ "fuking",
178
+ "g-spot",
179
+ "gangbang",
180
+ "gangbanged",
181
+ "gangbanger",
182
+ "gangbangs",
183
+ "genital",
184
+ "genitals",
185
+ "gigolo",
186
+ "glans",
187
+ "gonad",
188
+ "gonads",
189
+ "gook",
190
+ "gringo",
191
+ "gringos",
192
+ "grope",
193
+ "gspot",
194
+ "guido",
195
+ "handjob",
196
+ "haole",
197
+ "hapa",
198
+ "hardcore",
199
+ "hardon",
200
+ "harem",
201
+ "hentai",
202
+ "hindoo",
203
+ "hoe",
204
+ "hoes",
205
+ "honky",
206
+ "hooker",
207
+ "hookers",
208
+ "hooter",
209
+ "hooters",
210
+ "hori",
211
+ "horndog",
212
+ "horney",
213
+ "horniest",
214
+ "horny",
215
+ "humped",
216
+ "humper",
217
+ "humping",
218
+ "hussy",
219
+ "hymen",
220
+ "ikey",
221
+ "incest",
222
+ "injun",
223
+ "intercourse",
224
+ "interracial",
225
+ "jack-off",
226
+ "jackoff",
227
+ "jailbait",
228
+ "jerk-off",
229
+ "jerkoff",
230
+ "jiggy",
231
+ "jism",
232
+ "jizz",
233
+ "jizzed",
234
+ "kaffir",
235
+ "kafir",
236
+ "kike",
237
+ "kikes",
238
+ "kinkster",
239
+ "kinky",
240
+ "kkk",
241
+ "klan",
242
+ "kraut",
243
+ "labia",
244
+ "lapdance",
245
+ "libido",
246
+ "licker",
247
+ "licking",
248
+ "limey",
249
+ "lingerie",
250
+ "livesex",
251
+ "lolita",
252
+ "lovemaking",
253
+ "lust",
254
+ "lusting",
255
+ "masochist",
256
+ "masterbate",
257
+ "masterbating",
258
+ "masterbation",
259
+ "masturbate",
260
+ "masturbating",
261
+ "masturbation",
262
+ "milf",
263
+ "minge",
264
+ "missionary",
265
+ "molest",
266
+ "molestation",
267
+ "molester",
268
+ "munging",
269
+ "muschi",
270
+ "nads",
271
+ "naked",
272
+ "necked",
273
+ "necro",
274
+ "negress",
275
+ "negro",
276
+ "negroes",
277
+ "negroid",
278
+ "negros",
279
+ "nig",
280
+ "nigar",
281
+ "nigga",
282
+ "niggas",
283
+ "niggaz",
284
+ "nigger",
285
+ "niggers",
286
+ "nigra",
287
+ "nipple",
288
+ "nipples",
289
+ "nookie",
290
+ "nooky",
291
+ "nooner",
292
+ "nude",
293
+ "nudie",
294
+ "nudity",
295
+ "nymph",
296
+ "nympho",
297
+ "nymphomania",
298
+ "orgasim",
299
+ "orgasm",
300
+ "orgasms",
301
+ "orgies",
302
+ "orgy",
303
+ "orifice",
304
+ "p0rn",
305
+ "paedophile",
306
+ "pantie",
307
+ "panties",
308
+ "panty",
309
+ "pastie",
310
+ "pecker",
311
+ "pedo",
312
+ "pedophile",
313
+ "pedophilia",
314
+ "pedophiliac",
315
+ "peeper",
316
+ "peepshow",
317
+ "pegging",
318
+ "penetrate",
319
+ "penetration",
320
+ "penile",
321
+ "penis",
322
+ "penises",
323
+ "penus",
324
+ "perv",
325
+ "phallic",
326
+ "phonesex",
327
+ "pickaninnies",
328
+ "pimp",
329
+ "playboy",
330
+ "playgirl",
331
+ "poontang",
332
+ "porn",
333
+ "porno",
334
+ "pornography",
335
+ "pornos",
336
+ "pr0n",
337
+ "premature",
338
+ "preteen",
339
+ "pron",
340
+ "prostitute",
341
+ "pube",
342
+ "pubes",
343
+ "pubic",
344
+ "pubis",
345
+ "punani",
346
+ "pussies",
347
+ "pussy",
348
+ "pussys",
349
+ "pusy",
350
+ "puta",
351
+ "puto",
352
+ "queef",
353
+ "quickie",
354
+ "quicky",
355
+ "quim",
356
+ "randy",
357
+ "rape",
358
+ "raped",
359
+ "raper",
360
+ "raping",
361
+ "rapist",
362
+ "rectum",
363
+ "redneck",
364
+ "rednecks",
365
+ "redskin",
366
+ "redskins",
367
+ "rimjob",
368
+ "rimming",
369
+ "russki",
370
+ "s&m",
371
+ "sadism",
372
+ "sadist",
373
+ "sambo",
374
+ "santorum",
375
+ "schlong",
376
+ "scissoring",
377
+ "semen",
378
+ "sex",
379
+ "sexed",
380
+ "sexi",
381
+ "sexing",
382
+ "sexo",
383
+ "sexpot",
384
+ "sextoy",
385
+ "sexual",
386
+ "sexually",
387
+ "sexx",
388
+ "sexxx",
389
+ "sexxxy",
390
+ "sexxy",
391
+ "sexy",
392
+ "sh!t",
393
+ "sh1t",
394
+ "shagging",
395
+ "shemale",
396
+ "sissy",
397
+ "skank",
398
+ "skanks",
399
+ "slapper",
400
+ "slut",
401
+ "sluts",
402
+ "slutting",
403
+ "slutty",
404
+ "smut",
405
+ "smutty",
406
+ "sodomise",
407
+ "sodomite",
408
+ "sodomize",
409
+ "sodomy",
410
+ "spank",
411
+ "sperm",
412
+ "spic",
413
+ "spick",
414
+ "splooge",
415
+ "spooge",
416
+ "squaw",
417
+ "squirting",
418
+ "steamy",
419
+ "stiffy",
420
+ "strapon",
421
+ "suck",
422
+ "sucked",
423
+ "sucker",
424
+ "sucking",
425
+ "sucks",
426
+ "swallow",
427
+ "swallower",
428
+ "swinger",
429
+ "teabagging",
430
+ "testical",
431
+ "testicle",
432
+ "testicles",
433
+ "testis",
434
+ "threesome",
435
+ "threeway",
436
+ "titfuck",
437
+ "titjob",
438
+ "tits",
439
+ "tittie",
440
+ "titties",
441
+ "titty",
442
+ "tittyfuck",
443
+ "tity",
444
+ "toots",
445
+ "topless",
446
+ "trannie",
447
+ "tranny",
448
+ "tribadism",
449
+ "twat",
450
+ "twats",
451
+ "undies",
452
+ "undressing",
453
+ "upskirt",
454
+ "vag",
455
+ "vagina",
456
+ "vaginal",
457
+ "viagra",
458
+ "vibrator",
459
+ "virgin",
460
+ "vixen",
461
+ "voyeur",
462
+ "vulva",
463
+ "wank",
464
+ "wanker",
465
+ "wanking",
466
+ "wazoo",
467
+ "wedgie",
468
+ "wench",
469
+ "wetback",
470
+ "whore",
471
+ "whored",
472
+ "whorehouse",
473
+ "whores",
474
+ "whoring",
475
+ "wigger",
476
+ "willie",
477
+ "willies",
478
+ "willy",
479
+ "wog",
480
+ "wop",
481
+ "x-rated",
482
+ "xxx",
483
+ "xxxxxx",
484
+ "yaoi",
485
+ "yid",
486
+ "zoophile",
487
+ "zoophilia",
488
+ ]
489
+
490
+ badwords = {
491
+ "ar": english_badwords
492
+ + [
493
+ "احتلام",
494
+ "اغتصاب",
495
+ "بز",
496
+ "بزاز",
497
+ "بظر",
498
+ "بيضان",
499
+ "تمص",
500
+ "ثدي",
501
+ "جماع",
502
+ "حلمة",
503
+ "خنثي",
504
+ "خول",
505
+ "زب",
506
+ "سحاق",
507
+ "سحاقية",
508
+ "سكس",
509
+ "شاذ",
510
+ "شرج",
511
+ "شرموطة",
512
+ "شهوة",
513
+ "طيز",
514
+ "عاهرة",
515
+ "عرص",
516
+ "فرج",
517
+ "قحبة",
518
+ "قضيب",
519
+ "كس",
520
+ "لبوة",
521
+ "لحس",
522
+ "لعق",
523
+ "لواط",
524
+ "لوطي",
525
+ "مبادل",
526
+ "متناك",
527
+ "متناكة",
528
+ "مص",
529
+ "مفلقسة",
530
+ "نيك",
531
+ ],
532
+ "ca": english_badwords
533
+ + [
534
+ "avortament",
535
+ "anal",
536
+ "anus",
537
+ "cul",
538
+ "ass-fucker",
539
+ "asss",
540
+ "asshole",
541
+ "assholes",
542
+ "bolera",
543
+ "boles",
544
+ "bastardo",
545
+ "bellend",
546
+ "bestial",
547
+ "bestialitat",
548
+ "puta",
549
+ "femelles",
550
+ "picant",
551
+ "sagnant",
552
+ "mamada",
553
+ "bollok",
554
+ "boob",
555
+ "pits",
556
+ "buceta",
557
+ "bum",
558
+ "culata",
559
+ "catifa muncher",
560
+ "picar",
561
+ "cipa",
562
+ "clitoris",
563
+ "polla",
564
+ "galletejador",
565
+ "gallines",
566
+ "coon",
567
+ "merda",
568
+ "cum",
569
+ "correguda",
570
+ "cunillingus",
571
+ "boig",
572
+ "maleït",
573
+ "consolador",
574
+ "consoladors",
575
+ "dink",
576
+ "canalla",
577
+ "duche",
578
+ "dique",
579
+ "ejaculació",
580
+ "ejaculat",
581
+ "ejacula",
582
+ "ejaculant",
583
+ "fag",
584
+ "fagging",
585
+ "fagot",
586
+ "fagots",
587
+ "fanny",
588
+ "felching",
589
+ "fel.lació",
590
+ "brida",
591
+ "follar",
592
+ "follat",
593
+ "escuradents",
594
+ "follant",
595
+ "folles",
596
+ "fucks",
597
+ "empacadora de llaminadures",
598
+ "déu maldit",
599
+ "deu meu",
600
+ "infern",
601
+ "hore",
602
+ "córrer",
603
+ "retrocés",
604
+ "kock",
605
+ "llavis",
606
+ "lujuria",
607
+ "lució",
608
+ "masoquista",
609
+ "masturbarse",
610
+ "puta mare",
611
+ "nazi",
612
+ "nigger",
613
+ "negres",
614
+ "orgasim",
615
+ "orgasme",
616
+ "orgasmes",
617
+ "pecker",
618
+ "penis",
619
+ "piss",
620
+ "mossegat",
621
+ "pisser",
622
+ "pisses",
623
+ "pissing",
624
+ "treure de polleguera",
625
+ "caca",
626
+ "porno",
627
+ "pornografia",
628
+ "picades",
629
+ "pube",
630
+ "coques",
631
+ "gatet",
632
+ "violació",
633
+ "violador",
634
+ "recte",
635
+ "retard",
636
+ "rimming",
637
+ "sàdic",
638
+ "cargolar",
639
+ "escrot",
640
+ "semen",
641
+ "sexe",
642
+ "shag",
643
+ "borratxos",
644
+ "transsexual",
645
+ "mossegar",
646
+ "shitted",
647
+ "skank",
648
+ "smegma",
649
+ "smut",
650
+ "arrebat",
651
+ "fill de puta",
652
+ "spac",
653
+ "spunk",
654
+ "testicle",
655
+ "tit",
656
+ "tetas",
657
+ "titt",
658
+ "turd",
659
+ "vagina",
660
+ "viagra",
661
+ "vulva",
662
+ "wang",
663
+ "wank",
664
+ "x classificat",
665
+ "xxx",
666
+ ],
667
+ "en": english_badwords,
668
+ "es": english_badwords
669
+ + [
670
+ "Asesinato",
671
+ "Bollera",
672
+ "Cabrón",
673
+ "Caca",
674
+ "Chupada",
675
+ "Chupapollas",
676
+ "Chupetón",
677
+ "Concha de tu madre",
678
+ "Coprofagía",
679
+ "Coño",
680
+ "Culo",
681
+ "Drogas",
682
+ "Esperma",
683
+ "Fiesta de salchichas",
684
+ "Follador",
685
+ "Follar",
686
+ "Gilipichis",
687
+ "Gilipollas",
688
+ "Hacer una paja",
689
+ "Haciendo el amor",
690
+ "Heroína",
691
+ "Hija de puta",
692
+ "Hijaputa",
693
+ "Hijo de puta",
694
+ "Hijoputa",
695
+ "Idiota",
696
+ "Imbécil",
697
+ "Jilipollas",
698
+ "Kapullo",
699
+ "Lameculos",
700
+ "Maciza",
701
+ "Macizorra",
702
+ "Mamada",
703
+ "Marica",
704
+ "Mariconazo",
705
+ "Maricón",
706
+ "Mierda",
707
+ "Nazi",
708
+ "Orina",
709
+ "Pedo",
710
+ "Pendejo",
711
+ "Pervertido",
712
+ "Pezón",
713
+ "Pinche",
714
+ "Pis",
715
+ "Prostituta",
716
+ "Puta",
717
+ "Racista",
718
+ "Ramera",
719
+ "Semen",
720
+ "Sexo",
721
+ "Sexo oral",
722
+ "Soplagaitas",
723
+ "Soplapollas",
724
+ "Sádico",
725
+ "Tetas grandes",
726
+ "Travesti",
727
+ "Trio",
728
+ "Tía buena",
729
+ "Verga",
730
+ "Vulva",
731
+ "aborto",
732
+ "agallas",
733
+ "anal",
734
+ "ano",
735
+ "arrebatar",
736
+ "asno",
737
+ "atornillar",
738
+ "bastardo",
739
+ "bestial",
740
+ "bestialidad",
741
+ "bolas",
742
+ "bollok",
743
+ "bolsa de pelota",
744
+ "brida",
745
+ "buceta",
746
+ "cabron",
747
+ "cagadas",
748
+ "cagado",
749
+ "cagando",
750
+ "campana",
751
+ "carajo",
752
+ "chupar la polla",
753
+ "cipa",
754
+ "clítoris",
755
+ "concha",
756
+ "consolador",
757
+ "consoladores",
758
+ "corrida",
759
+ "coño",
760
+ "coños",
761
+ "culo",
762
+ "culos",
763
+ "cunillingus",
764
+ "córneo",
765
+ "de mierda",
766
+ "dique",
767
+ "duche",
768
+ "enojado",
769
+ "escroto",
770
+ "espacio",
771
+ "estúpido",
772
+ "extremo",
773
+ "eyacula",
774
+ "eyaculación",
775
+ "eyaculado",
776
+ "eyacular",
777
+ "fagging",
778
+ "felación",
779
+ "felching",
780
+ "folla",
781
+ "follada",
782
+ "follador de culo",
783
+ "folladores",
784
+ "follar",
785
+ "fudge packer",
786
+ "gallos",
787
+ "grieta",
788
+ "hacerse una paja",
789
+ "hijo de puta",
790
+ "hore",
791
+ "infierno",
792
+ "kock",
793
+ "labios vaginales",
794
+ "los pechos",
795
+ "lujuria",
796
+ "madre folladora",
797
+ "maldita sea",
798
+ "maldito",
799
+ "maldito sea",
800
+ "mamada",
801
+ "mapache",
802
+ "maricones",
803
+ "maricón",
804
+ "martillo",
805
+ "masoquista",
806
+ "masturbarse",
807
+ "mear",
808
+ "mierda",
809
+ "molesto",
810
+ "muncher alfombra",
811
+ "nazi",
812
+ "negro",
813
+ "niggers",
814
+ "orgasimo",
815
+ "orgasmo",
816
+ "orgasmos",
817
+ "orinando",
818
+ "pelusa",
819
+ "pene",
820
+ "perra",
821
+ "perras",
822
+ "perro follador",
823
+ "pinchazo",
824
+ "pinchazos",
825
+ "pisser",
826
+ "polla",
827
+ "porno",
828
+ "pornografía",
829
+ "pube",
830
+ "puta",
831
+ "putas",
832
+ "pájaro carpintero",
833
+ "quejas",
834
+ "recto",
835
+ "retardar",
836
+ "rimming",
837
+ "sangriento",
838
+ "semen",
839
+ "sexo",
840
+ "skank",
841
+ "smegma",
842
+ "sádico",
843
+ "testículo",
844
+ "teta",
845
+ "tetas",
846
+ "tirón",
847
+ "tizón",
848
+ "tonto",
849
+ "transexual",
850
+ "vagina",
851
+ "vete a la mierda",
852
+ "viagra",
853
+ "violación",
854
+ "violador",
855
+ "vulva",
856
+ "wang",
857
+ "x clasificado",
858
+ "xxx",
859
+ "zurullo",
860
+ ],
861
+ "eu": english_badwords
862
+ + [
863
+ "abortu",
864
+ "anal",
865
+ "ipurdi",
866
+ "kabroi",
867
+ "puta",
868
+ "clitoris",
869
+ "cunillingus",
870
+ "madarikatu",
871
+ "zakil",
872
+ "hazia isuri",
873
+ "arraio",
874
+ "izorratu",
875
+ "infernu",
876
+ "emagaldu",
877
+ "lizunkeri",
878
+ "lizun",
879
+ "masokista",
880
+ "masturbatu",
881
+ "nazi",
882
+ "beltz",
883
+ "orgasmo",
884
+ "pixa",
885
+ "porno",
886
+ "pornografia",
887
+ "alu",
888
+ "bortxaketa",
889
+ "bortxatzaile",
890
+ "sadista",
891
+ "ipurzulo",
892
+ "hazi",
893
+ "semen",
894
+ "sexu",
895
+ "kaka",
896
+ "putaseme",
897
+ "barrabil",
898
+ "titi",
899
+ "bagina",
900
+ "viagra",
901
+ ],
902
+ "fr": english_badwords
903
+ + [
904
+ "MALPT",
905
+ "anal",
906
+ "anus",
907
+ "arracher",
908
+ "avortement",
909
+ "baise",
910
+ "baiser",
911
+ "baiseur de chien",
912
+ "baiseurs",
913
+ "baisée",
914
+ "bander",
915
+ "bellend",
916
+ "bestial",
917
+ "bestialité",
918
+ "bigornette",
919
+ "bite",
920
+ "bitte",
921
+ "bloblos",
922
+ "bollok",
923
+ "boob",
924
+ "bordel",
925
+ "bourré",
926
+ "bourrée",
927
+ "bout",
928
+ "brackmard",
929
+ "branlage",
930
+ "branler",
931
+ "branlette",
932
+ "branleur",
933
+ "branleuse",
934
+ "bride",
935
+ "brouter le cresson",
936
+ "buceta",
937
+ "caca",
938
+ "chatte",
939
+ "chattes",
940
+ "chiasse",
941
+ "chienne",
942
+ "chiennes",
943
+ "chier",
944
+ "chiottes",
945
+ "chié",
946
+ "cipa",
947
+ "clito",
948
+ "clitoris",
949
+ "clochard",
950
+ "cochonneries",
951
+ "con",
952
+ "connard",
953
+ "connards",
954
+ "connasse",
955
+ "conne",
956
+ "convoitise",
957
+ "coq",
958
+ "coqs",
959
+ "corné",
960
+ "couilles",
961
+ "cramouille",
962
+ "cran",
963
+ "cul",
964
+ "culs",
965
+ "cunillingus",
966
+ "damné",
967
+ "des balles",
968
+ "digue",
969
+ "duché",
970
+ "déconne",
971
+ "déconner",
972
+ "emballeur de fudge",
973
+ "emmerdant",
974
+ "emmerder",
975
+ "emmerdeur",
976
+ "emmerdeuse",
977
+ "enculer",
978
+ "enculeur",
979
+ "enculeurs",
980
+ "enculé",
981
+ "enculée",
982
+ "enfer",
983
+ "enfoiré",
984
+ "enfoirée",
985
+ "espacer",
986
+ "fagging",
987
+ "fagot",
988
+ "fagots",
989
+ "faire chier",
990
+ "fellation",
991
+ "fente",
992
+ "fille de pute",
993
+ "fils de pute",
994
+ "folle",
995
+ "foutre",
996
+ "fuckings",
997
+ "gerbe",
998
+ "gerber",
999
+ "godemiché",
1000
+ "godes",
1001
+ "gouine",
1002
+ "grande folle",
1003
+ "grogniasse",
1004
+ "gueule",
1005
+ "hore",
1006
+ "jouir",
1007
+ "kock",
1008
+ "la putain de ta mère",
1009
+ "les lèvres",
1010
+ "les seins",
1011
+ "luxure",
1012
+ "masochiste",
1013
+ "masturber",
1014
+ "merde",
1015
+ "merdeuse",
1016
+ "merdeux",
1017
+ "merdique",
1018
+ "meuf",
1019
+ "mère enculée",
1020
+ "ménage à trois",
1021
+ "mésange",
1022
+ "nazi",
1023
+ "negro",
1024
+ "nique ta mère",
1025
+ "nique ta race",
1026
+ "nègre",
1027
+ "nègres",
1028
+ "orgasim",
1029
+ "orgasme",
1030
+ "orgasmes",
1031
+ "palucher",
1032
+ "penchant",
1033
+ "pipe",
1034
+ "pipi",
1035
+ "piquer",
1036
+ "piqûres",
1037
+ "pisse",
1038
+ "pisser",
1039
+ "porno",
1040
+ "pornographie",
1041
+ "pouffiasse",
1042
+ "pousse-crotte",
1043
+ "pube",
1044
+ "putain",
1045
+ "putain de",
1046
+ "pute",
1047
+ "pédale",
1048
+ "pédé",
1049
+ "pénis",
1050
+ "péter",
1051
+ "queue",
1052
+ "quéquette",
1053
+ "ramoner",
1054
+ "rectum",
1055
+ "retard",
1056
+ "rimming",
1057
+ "râpé",
1058
+ "sac de billes",
1059
+ "sac à foutre",
1060
+ "sac à merde",
1061
+ "sadique",
1062
+ "salaud",
1063
+ "salope",
1064
+ "salopes",
1065
+ "sanglant",
1066
+ "scrotum",
1067
+ "se branler",
1068
+ "seins",
1069
+ "sexe",
1070
+ "skank",
1071
+ "smegma",
1072
+ "sperme",
1073
+ "suce",
1074
+ "suceuse",
1075
+ "tanche",
1076
+ "tapette",
1077
+ "tapis muncher",
1078
+ "testicule",
1079
+ "teuch",
1080
+ "titt",
1081
+ "transexuelle",
1082
+ "tremper",
1083
+ "tringler",
1084
+ "trique",
1085
+ "troncher",
1086
+ "trou du cul",
1087
+ "turlute",
1088
+ "vagin",
1089
+ "viagra",
1090
+ "violeur",
1091
+ "vulve",
1092
+ "wang",
1093
+ "x évalué",
1094
+ "xxx",
1095
+ "zigounette",
1096
+ "zizi",
1097
+ "zut",
1098
+ "éjaculant",
1099
+ "éjaculation",
1100
+ "éjacule",
1101
+ "éjaculer",
1102
+ "éjaculé",
1103
+ "étron",
1104
+ ],
1105
+ "hi": english_badwords
1106
+ + [
1107
+ "aand",
1108
+ "aandu",
1109
+ "balatkar",
1110
+ "balatkari",
1111
+ "behen chod",
1112
+ "beti chod",
1113
+ "bhadva",
1114
+ "bhadve",
1115
+ "bhandve",
1116
+ "bhangi",
1117
+ "bhootni ke",
1118
+ "bhosad",
1119
+ "bhosadi ke",
1120
+ "bitching",
1121
+ "blowjob",
1122
+ "bollok",
1123
+ "boobe",
1124
+ "buceta",
1125
+ "chakke",
1126
+ "chinaal",
1127
+ "chinki",
1128
+ "chod",
1129
+ "chodu",
1130
+ "chodu bhagat",
1131
+ "chooche",
1132
+ "choochi",
1133
+ "choope",
1134
+ "choot",
1135
+ "choot ke baal",
1136
+ "chootia",
1137
+ "chootiya",
1138
+ "chuche",
1139
+ "chuchi",
1140
+ "chudaap",
1141
+ "chudai khanaa",
1142
+ "chudam chudai",
1143
+ "chude",
1144
+ "chut",
1145
+ "chut ka chuha",
1146
+ "chut ka churan",
1147
+ "chut ka mail",
1148
+ "chut ke baal",
1149
+ "chut ke dhakkan",
1150
+ "chut maarli",
1151
+ "chutad",
1152
+ "chutadd",
1153
+ "chutan",
1154
+ "chutia",
1155
+ "chutiya",
1156
+ "cipa",
1157
+ "cunillingus",
1158
+ "dink",
1159
+ "duche",
1160
+ "ejaculated",
1161
+ "ejaculates",
1162
+ "ejaculating",
1163
+ "fagging",
1164
+ "fagots",
1165
+ "felching",
1166
+ "fuckers",
1167
+ "fuckings",
1168
+ "fucks",
1169
+ "gaand",
1170
+ "gaandfat",
1171
+ "gaandmasti",
1172
+ "gaandufad",
1173
+ "gandfattu",
1174
+ "gandu",
1175
+ "gashti",
1176
+ "gasti",
1177
+ "ghassa",
1178
+ "ghasti",
1179
+ "gucchi",
1180
+ "gucchu",
1181
+ "harami",
1182
+ "haramzade",
1183
+ "hawas",
1184
+ "hawas ke pujari",
1185
+ "hijda",
1186
+ "hijra",
1187
+ "jhant",
1188
+ "jhant chaatu",
1189
+ "jhant ka keeda",
1190
+ "jhant ke baal",
1191
+ "jhant ke pissu",
1192
+ "jhantu",
1193
+ "kamine",
1194
+ "kaminey",
1195
+ "kanjar",
1196
+ "kutta",
1197
+ "kutta kamina",
1198
+ "kutte ki aulad",
1199
+ "kutte ki jat",
1200
+ "kuttiya",
1201
+ "loda",
1202
+ "lodu",
1203
+ "lund",
1204
+ "lund choos",
1205
+ "lund ka bakkal",
1206
+ "lund khajoor",
1207
+ "lundtopi",
1208
+ "lundure",
1209
+ "lusting",
1210
+ "maa ki chut",
1211
+ "maal",
1212
+ "madar chod",
1213
+ "madarchod",
1214
+ "madhavchod",
1215
+ "masochist",
1216
+ "mooh mein le",
1217
+ "mutth",
1218
+ "mutthal",
1219
+ "najayaz",
1220
+ "najayaz aulaad",
1221
+ "najayaz paidaish",
1222
+ "orgasim",
1223
+ "paki",
1224
+ "pataka",
1225
+ "patakha",
1226
+ "pisser",
1227
+ "pisses",
1228
+ "pissing",
1229
+ "pube",
1230
+ "pussies",
1231
+ "raand",
1232
+ "randaap",
1233
+ "randi",
1234
+ "randi rona",
1235
+ "rimming",
1236
+ "saala",
1237
+ "saala kutta",
1238
+ "saali kutti",
1239
+ "saali randi",
1240
+ "shagging",
1241
+ "shite",
1242
+ "shitted",
1243
+ "shitting",
1244
+ "shitty",
1245
+ "skank",
1246
+ "sluts",
1247
+ "spac",
1248
+ "suar",
1249
+ "suar ke lund",
1250
+ "suar ki aulad",
1251
+ "tatte",
1252
+ "tatti",
1253
+ "teri maa ka bhosada",
1254
+ "teri maa ka boba chusu",
1255
+ "teri maa ki behenchod ",
1256
+ "teri maa ki chut",
1257
+ "tharak",
1258
+ "tharki",
1259
+ "titt",
1260
+ "tu chuda",
1261
+ "turd",
1262
+ "wank",
1263
+ "xxx",
1264
+ "अंडकोश की थैली",
1265
+ "अंडा",
1266
+ "अरे नहीं",
1267
+ "अश्लील",
1268
+ "उल्लू",
1269
+ "एक्स रेटेड",
1270
+ "ओगाज़्म",
1271
+ "कमबख्त",
1272
+ "काम करना",
1273
+ "कामोद्दीपक चित्र",
1274
+ "कालीन का चूरा",
1275
+ "किन्नर",
1276
+ "कुतिया",
1277
+ "कुत्ते-कमीने",
1278
+ "कून",
1279
+ "कॉक",
1280
+ "गड़बड़",
1281
+ "गधा कमीने",
1282
+ "गधे",
1283
+ "गर्भपात",
1284
+ "गुदा",
1285
+ "गेंद का थैला",
1286
+ "गेंदों",
1287
+ "गोली चलाने की आवाज़",
1288
+ "घटिया इंसान",
1289
+ "चाकलेट का रंग",
1290
+ "चिंक",
1291
+ "चुभन",
1292
+ "चूची",
1293
+ "चूतड़",
1294
+ "चोंच",
1295
+ "छीनना",
1296
+ "जी में आये करो",
1297
+ "झटका बंद",
1298
+ "ठगना पैकर",
1299
+ "डिल्डो",
1300
+ "दुष्ट",
1301
+ "दूर जाने का अभद्र संकेत देना",
1302
+ "धत् तेरे की",
1303
+ "नरक",
1304
+ "नाजी",
1305
+ "निकला हुआ किनारा",
1306
+ "नितंब",
1307
+ "पंगा लेना",
1308
+ "पिछाड़ी",
1309
+ "पीड़न कामुक",
1310
+ "पेशाब",
1311
+ "पॉर्न",
1312
+ "फटना",
1313
+ "फूहड़",
1314
+ "बकवास",
1315
+ "बट",
1316
+ "बलात्कार",
1317
+ "बहुत मदहोश",
1318
+ "बांध",
1319
+ "बिल्ली",
1320
+ "बेल अंत",
1321
+ "बेवकूफों",
1322
+ "बोल पड़ना",
1323
+ "भगवान-शापित",
1324
+ "भगशेफ",
1325
+ "मल",
1326
+ "मलाशय",
1327
+ "माँ कमीने",
1328
+ "मुखमैथुन",
1329
+ "मुर्गा",
1330
+ "मुर्गा के",
1331
+ "मुर्गा चूसने वाला",
1332
+ "मूर्ख",
1333
+ "मैल",
1334
+ "योनि",
1335
+ "योनी",
1336
+ "यौन-संबंध",
1337
+ "रक्तरंजित",
1338
+ "लानत है",
1339
+ "लिंग",
1340
+ "लुटेरा",
1341
+ "लेबिया",
1342
+ "वहशी",
1343
+ "वहशीता",
1344
+ "वियाग्रा",
1345
+ "वीर्य",
1346
+ "वेश्या",
1347
+ "वैंग",
1348
+ "वो साले",
1349
+ "शिफ़्ट को",
1350
+ "शिश्नमल",
1351
+ "संभोग सुख",
1352
+ "सह",
1353
+ "सह शॉट",
1354
+ "साहस",
1355
+ "सिगरेट",
1356
+ "सींग का बना हुआ",
1357
+ "स्तन",
1358
+ "स्तनों",
1359
+ "हवस",
1360
+ "हस्तमैथुन",
1361
+ "होमोसेक्सुअल",
1362
+ "होर",
1363
+ ],
1364
+ "id": english_badwords
1365
+ + [
1366
+ "abortus",
1367
+ "anal",
1368
+ "dubur",
1369
+ "pantat",
1370
+ "bajingan",
1371
+ "keledai",
1372
+ "keparat",
1373
+ "tas bola",
1374
+ "bola",
1375
+ "bellend",
1376
+ "kejam",
1377
+ "kebinatangan",
1378
+ "menggerutu",
1379
+ "pelacur",
1380
+ "berdarah",
1381
+ "blowjob",
1382
+ "bollok",
1383
+ "dada",
1384
+ "payudara",
1385
+ "buceta",
1386
+ "gelandangan",
1387
+ "pengunyah karpet",
1388
+ "celah",
1389
+ "cipa",
1390
+ "kelentit",
1391
+ "kokang",
1392
+ "pengisap ayam",
1393
+ "ayam",
1394
+ "coon",
1395
+ "sampah",
1396
+ "air mani",
1397
+ "cumshot",
1398
+ "cunillingus",
1399
+ "vagina",
1400
+ "mengutuk",
1401
+ "kontol",
1402
+ "dildo",
1403
+ "dink",
1404
+ "anjing-keparat",
1405
+ "duche",
1406
+ "tanggul",
1407
+ "berejakulasi",
1408
+ "ejakulasi",
1409
+ "homo",
1410
+ "fagging",
1411
+ "kayu bakar",
1412
+ "penggemar",
1413
+ "felching",
1414
+ "fellatio",
1415
+ "flens",
1416
+ "brengsek",
1417
+ "kacau",
1418
+ "sialan",
1419
+ "persetan",
1420
+ "pengepakan fudge",
1421
+ "terkutuk",
1422
+ "ya tuhan",
1423
+ "neraka",
1424
+ "hore",
1425
+ "terangsang",
1426
+ "kock",
1427
+ "labia",
1428
+ "nafsu",
1429
+ "bernafsu",
1430
+ "masokis",
1431
+ "masturbasi",
1432
+ "keparat ibu",
1433
+ "nazi",
1434
+ "orang negro",
1435
+ "negro",
1436
+ "orgasim",
1437
+ "orgasme",
1438
+ "cotok",
1439
+ "penis",
1440
+ "kencing",
1441
+ "kesal",
1442
+ "pisser",
1443
+ "bikin",
1444
+ "buritan",
1445
+ "porno",
1446
+ "pornografi",
1447
+ "tusukan",
1448
+ "menusuk",
1449
+ "pube",
1450
+ "pussies",
1451
+ "memperkosa",
1452
+ "pemerkosa",
1453
+ "memperlambat",
1454
+ "rimming",
1455
+ "sadis",
1456
+ "meniduri",
1457
+ "skrotum",
1458
+ "seks",
1459
+ "bercinta",
1460
+ "waria",
1461
+ "kotoran",
1462
+ "shite",
1463
+ "kengerian",
1464
+ "dikirim",
1465
+ "buang hajat",
1466
+ "menyebalkan",
1467
+ "smegma",
1468
+ "jelaga",
1469
+ "merebut",
1470
+ "dasar bajingan",
1471
+ "ruang",
1472
+ "keberanian",
1473
+ "buah pelir",
1474
+ "titt",
1475
+ "viagra",
1476
+ "vulva",
1477
+ "wang",
1478
+ "terima kasih",
1479
+ "x diberi peringkat",
1480
+ "xxx",
1481
+ ],
1482
+ "kn": english_badwords
1483
+ + [
1484
+ "ಗರ್ಭಪಾತ",
1485
+ "ಗುದ",
1486
+ "ಗುದದ್ವಾರ",
1487
+ "ಕತ್ತೆ",
1488
+ "ಆಶ್-ಫಕರ್",
1489
+ "ಅಸ್ಹೋಲ್",
1490
+ "ಅಸೋಲೆಸ್",
1491
+ "ಬಾಲ್ಬಾಗ್",
1492
+ "ಚೆಂಡುಗಳು",
1493
+ "ಬಾಸ್ಟರ್ಡ್",
1494
+ "ಬೆಲೆಂಡ್",
1495
+ "ಮೃದ್ವಂಗಿ",
1496
+ "ಪ್ರಾಣಿಜನ್ಯತೆ",
1497
+ "ಬಿಚ್",
1498
+ "ಬಿಟ್ಚಿಸ್",
1499
+ "ಬೆಚಿಂಗ್",
1500
+ "ರಕ್ತಸಿಕ್ತ",
1501
+ "ಬ್ಲೋಜಾಬ್",
1502
+ "ಬೊಲ್ಲೊಕ್",
1503
+ "ಕುರುಚಲು ಗಿಡ",
1504
+ "ಬೂಬಿಗಳು",
1505
+ "ಸ್ತನಗಳನ್ನು",
1506
+ "ಬುಕೆಟಾ",
1507
+ "ತಿಕ",
1508
+ "ಬಟ್",
1509
+ "ಕಾರ್ಪೆಟ್ ಮಂಚರ್",
1510
+ "ಚಿಂಕ್",
1511
+ "ಸಿಪಾ",
1512
+ "ಚಂದ್ರನಾಡಿ",
1513
+ "ಕೋಳಿ",
1514
+ "ಕೋಳಿ ಸಕ್ಕರ್",
1515
+ "ಕಾಕ್ಸ್",
1516
+ "ಕೂನ್",
1517
+ "ಅಮೇಧ್ಯ",
1518
+ "ಕಮ್",
1519
+ "ಕಮ್ಶಾಟ್",
1520
+ "ಕುನಿಲ್ಲಸ್",
1521
+ "ಕಂಟ್",
1522
+ "ಡ್ಯಾಮ್",
1523
+ "ಡಿಕ್",
1524
+ "ದ್ವಿಧ್ರುವಿ",
1525
+ "dildos",
1526
+ "ಡಿಂಕ್",
1527
+ "ನಾಯಿ-ಫಕರ್",
1528
+ "ಡಚೆ",
1529
+ "ಡೈಕ್",
1530
+ "ಹೊರಹೊಮ್ಮಿಸು",
1531
+ "ಸ್ಫೂರ್ತಿ",
1532
+ "ಎಜಾಕ್ಯುಲೇಟ್ಸ್",
1533
+ "ಇಜಲಲೇಟಿಂಗ್",
1534
+ "ಉದ್ಗಾರ",
1535
+ "ತಮಾಷೆ",
1536
+ "ಮಂದಗತಿ",
1537
+ "ಮಬ್ಬು",
1538
+ "fagots",
1539
+ "ಫ್ಯಾನಿ",
1540
+ "ಹೊಡೆತ",
1541
+ "ಪತನ",
1542
+ "ಚಾಚುಪಟ್ಟಿ",
1543
+ "ಫಕ್",
1544
+ "ನಾಶವಾಗಿದ್ದನು",
1545
+ "ಫಕರ್",
1546
+ "fuckers",
1547
+ "ಫಕಿಂಗ್",
1548
+ "ಫಕಿಂಗ್ಸ್",
1549
+ "ಇಷ್ಟಪಡುತ್ತಾನೆ",
1550
+ "ಮಿಠಾಯಿ ಪ್ಯಾಕರ್",
1551
+ "ದೇವರನ್ನು ಹಾನಿಗೊಳಗಾಯಿತು",
1552
+ "ಗಾಡ್ಡಮ್",
1553
+ "ನರಕ",
1554
+ "ಹೋರ್",
1555
+ "ಮೊನಚಾದ",
1556
+ "ಜರ್ಕ್-ಆಫ್",
1557
+ "ಕೋಕ್",
1558
+ "ಯೋನಿಯ",
1559
+ "ಕಾಮ",
1560
+ "ಕಾಮುಕ",
1561
+ "ಮಾಸೋಚಿಸ್ಟ್",
1562
+ "ಹಸ್ತಮೈಥುನ ಮಾಡು",
1563
+ "ತಾಯಿ ಫಕರ್",
1564
+ "ನಾಜಿ",
1565
+ "ನಿಗರ್",
1566
+ "ನಿಗ್ಗರ್ಗಳು",
1567
+ "ಒರಾಸಿಮ್",
1568
+ "ಪರಾಕಾಷ್ಠೆ",
1569
+ "ಪರಾಕಾಷ್ಠೆಗಳನ್ನು",
1570
+ "ಪೆಕರ್",
1571
+ "ಶಿಶ್ನ",
1572
+ "ಮೂತ್ರ ವಿಸರ್ಜಿಸು",
1573
+ "ನಿರುತ್ಸಾಹಗೊಂಡಿದೆ",
1574
+ "ಪಿಸರ್",
1575
+ "ಮೂತ್ರಪಿಂಡಗಳು",
1576
+ "pissing",
1577
+ "ಪಿಸ್ಸಾಫ್",
1578
+ "ಪೂಪ್",
1579
+ "ಅಶ್ಲೀಲತೆ",
1580
+ "ಅಶ್ಲೀಲ",
1581
+ "ಚುಚ್ಚು",
1582
+ "ಪ್ರಿಕ್ಸ್",
1583
+ "ಪಬ್",
1584
+ "ಪುಸಿಗಳು",
1585
+ "ಪುಸಿ",
1586
+ "ಅತ್ಯಾಚಾರ",
1587
+ "ಅತ್ಯಾಚಾರಿ",
1588
+ "ಗುದನಾಳದ",
1589
+ "ರಿಟಾರ್ಡ್",
1590
+ "ಹಚ್ಚುವುದು",
1591
+ "ದುಃಖಗಾರ",
1592
+ "ತಿರುಗಿಸುವುದು",
1593
+ "ಸ್ಕ್ರೋಟಮ್",
1594
+ "ವೀರ್ಯ",
1595
+ "ಲೈಂಗಿಕತೆ",
1596
+ "ಶಾಗ್",
1597
+ "ಶಾಗ್ಗಿಂಗ್",
1598
+ "ಶೆಮೇಲ್",
1599
+ "ಶಿಟ್",
1600
+ "ಷೈಟ್",
1601
+ "ಶಿಟ್ಸ್",
1602
+ "shitted",
1603
+ "ಅಲುಗಾಡುವಿಕೆ",
1604
+ "ಅಸಹ್ಯ",
1605
+ "ಸ್ಕಾಂಕ್",
1606
+ "ಸೂಳೆ",
1607
+ "ಸ್ಲಟ್ಗಳು",
1608
+ "ಸ್ಮೆಗ್ಮಾ",
1609
+ "ಕೊಳೆತ",
1610
+ "ಸ್ನ್ಯಾಚ್",
1611
+ "ಮಗ-ಆಫ್-ಬಿಚ್",
1612
+ "spac",
1613
+ "ಉಬ್ಬು",
1614
+ "ವೃಷಣ",
1615
+ "ಟಿಟ್",
1616
+ "ಚೇಕಡಿ ಹಕ್ಕಿಗಳು",
1617
+ "turd",
1618
+ "ಯೋನಿ",
1619
+ "ವಯಾಗ್ರ",
1620
+ "ವಾಂಗ್",
1621
+ "ಮುಷ್ಕರ",
1622
+ "x ರೇಟೆಡ್",
1623
+ "xxx",
1624
+ ],
1625
+ "ml": english_badwords
1626
+ + [
1627
+ "ഗർഭഛിദ്രം",
1628
+ "വിശപ്പ്",
1629
+ "മലദ്വാരം",
1630
+ "കഴുത",
1631
+ "അസി ഫക്കർ",
1632
+ "കഴുതകളെ",
1633
+ "ആസ്ഹോൾ",
1634
+ "അശ്ളീലങ്ങൾ",
1635
+ "ബോൾബാഗ്",
1636
+ "പന്തുകൾ",
1637
+ "തന്തയില്ലാത്തവൻ",
1638
+ "ബെല്ലെൻഡ്",
1639
+ "മൃഗീയമായ",
1640
+ "മൃഗീയത",
1641
+ "ബിച്ച്",
1642
+ "ബിച്ചുകൾ",
1643
+ "ബിപിഡിംഗ്",
1644
+ "രക്തരൂക്ഷിതമായ",
1645
+ "ആശ്വാസം",
1646
+ "ബലോക്ക്",
1647
+ "ബോബ്",
1648
+ "പൂക്കൾ",
1649
+ "സ്തനങ്ങൾ",
1650
+ "ബ്യൂട്ടാ",
1651
+ "ബം",
1652
+ "മയക്കുമരുന്ന്",
1653
+ "പരവതാനി മാൻച്ചർ",
1654
+ "ചുംബ്",
1655
+ "സിപാ",
1656
+ "ക്ലോറിസിസ്",
1657
+ "കോക്ക്",
1658
+ "കോക്ക് സക്കർ",
1659
+ "കോക്സ്",
1660
+ "കോൺ",
1661
+ "ക്രാപ്പ്",
1662
+ "ശുക്ലം",
1663
+ "പുരുഷാരം",
1664
+ "സി",
1665
+ "മുഷിഞ്ഞ",
1666
+ "കഷ്ടം",
1667
+ "ഡിക്ക്",
1668
+ "ഡിൽഡോ",
1669
+ "dildos",
1670
+ "ഡൈൻ",
1671
+ "നായ-ഫക്കർ",
1672
+ "ഡച്ച്",
1673
+ "ഡൈകെ",
1674
+ "ശമിപ്പിക്കുക",
1675
+ "മോഷ്ടിച്ചു",
1676
+ "വികാരങ്ങൾ",
1677
+ "വിരസത",
1678
+ "മടി",
1679
+ "ക്ഷീണിപ്പിക്കുക",
1680
+ "fagot",
1681
+ "വഞ്ചന",
1682
+ "ഫാനി",
1683
+ "വേദന",
1684
+ "flange",
1685
+ "ഊമ്പി",
1686
+ "സംഭോഗം ചെയ്യുക",
1687
+ "ഫക്കർ",
1688
+ "നർമ്മം",
1689
+ "ഫഡ്ജ് പാക്കർ",
1690
+ "ദൈവം-കൊള്ളിത",
1691
+ "ഗോഡ്ഡം",
1692
+ "നരകം",
1693
+ "വയ്ക്കുക",
1694
+ "വൃത്തികെട്ട",
1695
+ "ജെർക് ഓഫ്",
1696
+ "കിക്ക്",
1697
+ "ലാബിയ",
1698
+ "മോഹം",
1699
+ "മോഹഭംഗം",
1700
+ "മാസോച്ചിസ്റ്റ്",
1701
+ "സ്വയംഭോഗം ചെയ്യുക",
1702
+ "അമ്മ ഫക്കർ",
1703
+ "നാസി",
1704
+ "നിഗർ",
1705
+ "മയക്കുമരുന്നുകൾ",
1706
+ "രതിമൂർച്ഛ",
1707
+ "പെക്കർ",
1708
+ "ലിംഗം",
1709
+ "മൂത്രമൊഴിക്കുക",
1710
+ "കുഴഞ്ഞുവീഴുന്നു",
1711
+ "പിസ്സർ",
1712
+ "പിസ്സകൾ",
1713
+ "pissing",
1714
+ "പിസ്സോഫ്",
1715
+ "poop",
1716
+ "അശ്ലീലം",
1717
+ "അശ്ലീലത",
1718
+ "പ്രാവി",
1719
+ "വിസർജ്യങ്ങൾ",
1720
+ "പ്യൂബ്",
1721
+ "pussies",
1722
+ "pussy",
1723
+ "ബലാൽസംഗം",
1724
+ "ബലാത്സംഗം",
1725
+ "മലാശയം",
1726
+ "തുടരുക",
1727
+ "റിമ്മിംഗ്",
1728
+ "സചിസ്റ്റ്",
1729
+ "വഞ്ചി",
1730
+ "പുല്ല്",
1731
+ "ബീജം",
1732
+ "ശവം",
1733
+ "ഷാഗിംഗ്",
1734
+ "അവൾ",
1735
+ "ഷീറ്റ്",
1736
+ "ഷെയ്റ്റ്",
1737
+ "shits",
1738
+ "തിന്നിട്ടില്ല",
1739
+ "ഷോർട്ട്",
1740
+ "ഷൈറ്റി",
1741
+ "സ്കാൻ",
1742
+ "മന്ദഹസരം",
1743
+ "സ്നെഗമാ",
1744
+ "പുഞ്ചിരി",
1745
+ "പിടിക്കുക",
1746
+ "വെറുക്കപ്പെട്ടയാൾ",
1747
+ "സ്പെയ്ക്",
1748
+ "തുളച്ച്",
1749
+ "വൃഷണം",
1750
+ "പേ",
1751
+ "ടിത്ത്",
1752
+ "കുഴപ്പമില്ല",
1753
+ "യോനി",
1754
+ "വരാഗ്ര",
1755
+ "വാൽവ",
1756
+ "വാങ്",
1757
+ "വാൻ",
1758
+ "വേശ്യ",
1759
+ "x റേറ്റുചെയ്തു",
1760
+ "xxx",
1761
+ ],
1762
+ "mr": english_badwords
1763
+ + [
1764
+ "गर्भपात",
1765
+ "गुदा",
1766
+ "गाढव",
1767
+ "गांडुळ",
1768
+ "asses",
1769
+ "asshole",
1770
+ "assholes",
1771
+ "ballbag",
1772
+ "चेंडू",
1773
+ "बॅस्टर्ड",
1774
+ "बेलेंड",
1775
+ "बेस्टियल",
1776
+ "प्राण्यांबरोबर",
1777
+ "कुत्री",
1778
+ "बिट्स",
1779
+ "खूनी",
1780
+ "blowjob",
1781
+ "बोलोक",
1782
+ "बोब",
1783
+ "स्तन",
1784
+ "बसीटा",
1785
+ "बम",
1786
+ "बट",
1787
+ "कार्पेट मुन्चर",
1788
+ "चिंक",
1789
+ "सिपा",
1790
+ "क्लिटोरिस",
1791
+ "मुर्ख",
1792
+ "मांसाहारी",
1793
+ "कॉक्स",
1794
+ "कॉनन",
1795
+ "बकवास",
1796
+ "सह",
1797
+ "cumshot",
1798
+ "कनिलिंगस",
1799
+ "कांट",
1800
+ "धिक्कार",
1801
+ "डिक",
1802
+ "dildo",
1803
+ "डिल्डो",
1804
+ "डंक",
1805
+ "duche",
1806
+ "डाईक",
1807
+ "उद्गार",
1808
+ "उत्साही",
1809
+ "ejaculates",
1810
+ "उत्सुकता",
1811
+ "स्खलन",
1812
+ "फॅग",
1813
+ "फॅगिंग",
1814
+ "फॅगॉट",
1815
+ "फॅगॉट्स",
1816
+ "फॅनी",
1817
+ "फेलिंग",
1818
+ "फॅलेटीओ",
1819
+ "निकला",
1820
+ "fucked",
1821
+ "गुप्तचर",
1822
+ "fuckers",
1823
+ "fucking",
1824
+ "fuckings",
1825
+ "fucks",
1826
+ "फडगे पॅकर",
1827
+ "देव-शापित",
1828
+ "देव",
1829
+ "नरक",
1830
+ "होरे",
1831
+ "शिंग",
1832
+ "झटका बंद",
1833
+ "कॉक",
1834
+ "लॅबिया",
1835
+ "वासना",
1836
+ "मासोचिस्ट",
1837
+ "हस्तमैथुन करा",
1838
+ "आई माकड",
1839
+ "नाझी",
1840
+ "निगर",
1841
+ "निगार",
1842
+ "ऑर्गॅसिम",
1843
+ "संभोग",
1844
+ "orgasms",
1845
+ "चापटी",
1846
+ "पुरुषाचे जननेंद्रिय",
1847
+ "पेशी",
1848
+ "pissed",
1849
+ "पिसर",
1850
+ "pisses",
1851
+ "पिसिंग",
1852
+ "पिसोफ",
1853
+ "घाट",
1854
+ "अश्लील",
1855
+ "पोर्नोग्राफी",
1856
+ "मुरुम",
1857
+ "प्रिक्स",
1858
+ "प्यूब",
1859
+ "pussies",
1860
+ "मांजर",
1861
+ "बलात्कार",
1862
+ "गुदाशय",
1863
+ "मंद",
1864
+ "rimming",
1865
+ "दुःखी",
1866
+ "screwing",
1867
+ "स्क्रोटम",
1868
+ "वीर्य",
1869
+ "लिंग",
1870
+ "शेग",
1871
+ "shagging",
1872
+ "शेमले",
1873
+ "विचित्र",
1874
+ "shite",
1875
+ "shits",
1876
+ "shitted",
1877
+ "shitting",
1878
+ "shitty",
1879
+ "घाणेरडा",
1880
+ "फट",
1881
+ "sluts",
1882
+ "सुगंध",
1883
+ "स्मट",
1884
+ "छेडछाड",
1885
+ "मुलगा-एक-कुत्री",
1886
+ "spac",
1887
+ "तिरस्कार",
1888
+ "परीक्षक",
1889
+ "शीर्षक",
1890
+ "टिट",
1891
+ "टर्ड",
1892
+ "योनी",
1893
+ "वियाग्रा",
1894
+ "वल्वा",
1895
+ "वांग",
1896
+ "विंक",
1897
+ "वेश्या",
1898
+ "एक्स रेट केले",
1899
+ "xxx",
1900
+ ],
1901
+ "pt": english_badwords
1902
+ + [
1903
+ "aborto",
1904
+ "amador",
1905
+ "anal",
1906
+ "aparafusar",
1907
+ "aranha",
1908
+ "ariano",
1909
+ "arrebatar",
1910
+ "ass-filho da puta",
1911
+ "asses",
1912
+ "balalao",
1913
+ "bastardo",
1914
+ "bate uma",
1915
+ "bellend",
1916
+ "bestial",
1917
+ "bestialidade",
1918
+ "bicha",
1919
+ "bichano",
1920
+ "bichanos",
1921
+ "bichas",
1922
+ "biscate",
1923
+ "bissexual",
1924
+ "boceta",
1925
+ "bolas",
1926
+ "bollok",
1927
+ "boob",
1928
+ "boquete",
1929
+ "bosta",
1930
+ "braulio de borracha",
1931
+ "buceta",
1932
+ "bumbum",
1933
+ "bunda",
1934
+ "burro",
1935
+ "cabrao",
1936
+ "cacete",
1937
+ "cadela",
1938
+ "cadelas",
1939
+ "cagando",
1940
+ "cagar",
1941
+ "calçado",
1942
+ "camisinha",
1943
+ "caralho",
1944
+ "cerveja",
1945
+ "chochota",
1946
+ "chupar",
1947
+ "cipa",
1948
+ "clitoris",
1949
+ "clitóris",
1950
+ "cobiçoso",
1951
+ "cocaína",
1952
+ "cocô",
1953
+ "coito",
1954
+ "colhoes",
1955
+ "com tesão",
1956
+ "comedor de tapetes",
1957
+ "comer",
1958
+ "cona",
1959
+ "consolo",
1960
+ "coon",
1961
+ "coragem",
1962
+ "corno",
1963
+ "cu",
1964
+ "cunillingus",
1965
+ "dar o rabo",
1966
+ "desgraçado",
1967
+ "dildo",
1968
+ "dildos",
1969
+ "dink",
1970
+ "dog-filho da puta",
1971
+ "droga",
1972
+ "duche",
1973
+ "dum raio",
1974
+ "ejacula",
1975
+ "ejaculado",
1976
+ "ejacular",
1977
+ "ejaculação",
1978
+ "empacotador de fudge",
1979
+ "escroto",
1980
+ "esporra",
1981
+ "estuprador",
1982
+ "estupro",
1983
+ "fagging",
1984
+ "fanny",
1985
+ "fecal",
1986
+ "felação",
1987
+ "felching",
1988
+ "fenda",
1989
+ "filho da puta",
1990
+ "filhos da puta",
1991
+ "foda",
1992
+ "foda-se",
1993
+ "fode",
1994
+ "foder",
1995
+ "fodido",
1996
+ "frango assado",
1997
+ "galo",
1998
+ "galos",
1999
+ "gozada",
2000
+ "gozar",
2001
+ "grelho",
2002
+ "heroína",
2003
+ "homem gay",
2004
+ "homoerótico",
2005
+ "homosexual",
2006
+ "hore",
2007
+ "idiota",
2008
+ "idiotas",
2009
+ "inferno",
2010
+ "kock",
2011
+ "lolita",
2012
+ "luxúria",
2013
+ "lábios",
2014
+ "lésbica",
2015
+ "maldito",
2016
+ "mama",
2017
+ "masoquista",
2018
+ "masturbar",
2019
+ "merda",
2020
+ "merdas",
2021
+ "mesa",
2022
+ "mijando",
2023
+ "mijar",
2024
+ "nazista",
2025
+ "negro",
2026
+ "niggers",
2027
+ "não me chateies",
2028
+ "orgasim",
2029
+ "orgasmo",
2030
+ "orgasmos",
2031
+ "otário",
2032
+ "paneleiro",
2033
+ "passar um cheque",
2034
+ "pau",
2035
+ "peidar",
2036
+ "peitos",
2037
+ "peituda",
2038
+ "pica",
2039
+ "picadas",
2040
+ "pinto",
2041
+ "pisser",
2042
+ "porcaria",
2043
+ "porno",
2044
+ "pornografia",
2045
+ "pornô",
2046
+ "porra",
2047
+ "prostituta",
2048
+ "pube",
2049
+ "punheta",
2050
+ "puta",
2051
+ "puta que pariu",
2052
+ "puta que te pariu",
2053
+ "putaria",
2054
+ "puto",
2055
+ "pênis",
2056
+ "queca",
2057
+ "retardar",
2058
+ "reto",
2059
+ "rimming",
2060
+ "sacanagem",
2061
+ "saco",
2062
+ "saco de bola",
2063
+ "sangrento",
2064
+ "sapatona",
2065
+ "sexo",
2066
+ "shite",
2067
+ "skank",
2068
+ "smegma",
2069
+ "spac",
2070
+ "sujeira",
2071
+ "sádico",
2072
+ "sêmen",
2073
+ "testículo",
2074
+ "tetas",
2075
+ "titt",
2076
+ "torneira",
2077
+ "transando",
2078
+ "transar",
2079
+ "transsexual",
2080
+ "trepada",
2081
+ "vadia",
2082
+ "vadias",
2083
+ "vagabunda",
2084
+ "vagabundo",
2085
+ "vagina",
2086
+ "vai tomar no cu",
2087
+ "vai-te foder",
2088
+ "veado",
2089
+ "viagra",
2090
+ "vibrador",
2091
+ "vulva",
2092
+ "wang",
2093
+ "x avaliado",
2094
+ "xana",
2095
+ "xixi",
2096
+ "xochota",
2097
+ "xxx",
2098
+ "ânus",
2099
+ ],
2100
+ "te": english_badwords
2101
+ + [
2102
+ "గర్భస్రావం",
2103
+ "అంగ",
2104
+ "పాయువు",
2105
+ "గాడిద",
2106
+ "గాడిద-fucker",
2107
+ "asses",
2108
+ "assholes",
2109
+ "బాల్బ్యాగ్",
2110
+ "బంతుల్లో",
2111
+ "బాస్టర్డ్",
2112
+ "బెల్లెండ్",
2113
+ "మృగ",
2114
+ "బెస్టియాలిటీ",
2115
+ "బిచ్",
2116
+ "bitches",
2117
+ "బిట్చింగ్",
2118
+ "బ్లడీ",
2119
+ "blowjob",
2120
+ "బోల్లక",
2121
+ "బూబ్",
2122
+ "వక్షోజాలను",
2123
+ "ఛాతీ",
2124
+ "buceta",
2125
+ "బం",
2126
+ "బట్",
2127
+ "కార్పెట్ ముంచర్",
2128
+ "చింక్",
2129
+ "cipa",
2130
+ "స్త్రీగుహ్యాంకురము",
2131
+ "ఆత్మవిశ్వాసం",
2132
+ "కాక్-సక్కర్",
2133
+ "కాక్స్",
2134
+ "కూన్",
2135
+ "చెత్త",
2136
+ "కం",
2137
+ "cumshot",
2138
+ "క్యునిల్లింగస్",
2139
+ "కంట్",
2140
+ "తిట్టు",
2141
+ "డిక్",
2142
+ "లైంగిక సంతృప్తి కోసం స్త్రీలు ఉపయోగించే పురుషాంగము వంటి పరికరము",
2143
+ "డిల్డోస్",
2144
+ "dink",
2145
+ "కుక్క-fucker",
2146
+ "డూష్",
2147
+ "డైక్",
2148
+ "స్ఖలించు",
2149
+ "ఎజాక్యులేటెడ్",
2150
+ "ఎజాక్యులేట్స్",
2151
+ "ఎరాక్యులేటింగ్",
2152
+ "స్ఖలనం",
2153
+ "నవుకరు",
2154
+ "ఫాగ్గింగ్",
2155
+ "ఫాగాట్",
2156
+ "ఫగాట్స్",
2157
+ "fanny",
2158
+ "ఫెల్చింగ్",
2159
+ "కుడుచుట",
2160
+ "అచ్చు",
2161
+ "ఫక్",
2162
+ "ఇబ్బంది పెట్టాడు",
2163
+ "fucker",
2164
+ "ఫకర్స్",
2165
+ "ఫకింగ్",
2166
+ "ఫకింగ్స్",
2167
+ "ఫక్స్",
2168
+ "ఫడ్జ్ ప్యాకర్",
2169
+ "దేవతలా మంచిది",
2170
+ "గాడ్డామ్",
2171
+ "నరకం",
2172
+ "హోర్",
2173
+ "horny",
2174
+ "జెర్క్-ఆఫ్",
2175
+ "కాక్",
2176
+ "పెదవి",
2177
+ "కామం",
2178
+ "మనసు పడ్డట్లు చిత్రించారు",
2179
+ "masochist",
2180
+ "హస్తప్రయోగం",
2181
+ "తల్లి ఫెకర్",
2182
+ "నాజీ",
2183
+ "నిగ్గర్",
2184
+ "నిగ్గర్స్",
2185
+ "ఆర్గాసిమ్",
2186
+ "స్కలనం",
2187
+ "orgasms",
2188
+ "pecker",
2189
+ "పురుషాంగం",
2190
+ "విసర్జన",
2191
+ "pissed",
2192
+ "పిస్సర్",
2193
+ "పిస్సీస్",
2194
+ "పిస్సింగ్",
2195
+ "పిస్సాఫ్",
2196
+ "poop",
2197
+ "శృంగార",
2198
+ "పోర్నో",
2199
+ "అశ్లీల",
2200
+ "బుడతడు",
2201
+ "ప్రిక్స్",
2202
+ "ప్యూబ్",
2203
+ "pussies",
2204
+ "పుస్సీ",
2205
+ "రేప్",
2206
+ "ఉన్నప్పటికీ బలాత్కారం",
2207
+ "పురీషనాళం",
2208
+ "రిటార్డ్",
2209
+ "రిమ్మింగ్",
2210
+ "పీడన కాముకత",
2211
+ "screwing",
2212
+ "స్క్రోటమ్",
2213
+ "వీర్యం",
2214
+ "సెక్స్",
2215
+ "బొచ్చు",
2216
+ "షగ్గింగ్",
2217
+ "షీమేల్",
2218
+ "ఒంటి",
2219
+ "షైట్",
2220
+ "షిట్స్",
2221
+ "షిట్టెడ్",
2222
+ "షిట్టింగ్",
2223
+ "shitty",
2224
+ "స్కాన్క్",
2225
+ "నీతి",
2226
+ "స్లట్స్",
2227
+ "శిశ్న",
2228
+ "స్మట్",
2229
+ "స్నాచ్",
2230
+ "ఒక బిచ్ కుమారుడు ఆఫ్",
2231
+ "spac",
2232
+ "స్పంక్",
2233
+ "వృషణాలు",
2234
+ "తునక",
2235
+ "టిట్స్",
2236
+ "టిట్",
2237
+ "turd",
2238
+ "యోని",
2239
+ "వయాగ్రా",
2240
+ "జననాంగం",
2241
+ "వాంగ్",
2242
+ "వ్యాంక్",
2243
+ "వేశ్య",
2244
+ "x రేట్",
2245
+ "xxx",
2246
+ ],
2247
+ "vi": english_badwords
2248
+ + [
2249
+ "sự phá thai",
2250
+ "hậu môn",
2251
+ "mông",
2252
+ "đồ ngu",
2253
+ "lừa",
2254
+ "lỗ đít",
2255
+ "túi bóng",
2256
+ "những quả bóng",
2257
+ "đồ khốn",
2258
+ "tuyệt vời",
2259
+ "mục sư",
2260
+ "lòng tốt",
2261
+ "chó cái",
2262
+ "dính máu",
2263
+ "công việc thổi",
2264
+ "bollok",
2265
+ "boob",
2266
+ "ngực",
2267
+ "buceta",
2268
+ "ăn mày",
2269
+ "thảm muncher",
2270
+ "sứt mẻ",
2271
+ "cipa",
2272
+ "âm vật",
2273
+ "gà",
2274
+ "gà hút",
2275
+ "gà trống",
2276
+ "coon",
2277
+ "tào lao",
2278
+ "kiêm",
2279
+ "cum",
2280
+ "cunillingus",
2281
+ "lồn",
2282
+ "chỉ trích",
2283
+ "tinh ranh",
2284
+ "dương vật giả",
2285
+ "dink",
2286
+ "chó-chó",
2287
+ "duche",
2288
+ "đê",
2289
+ "xuất tinh",
2290
+ "fag",
2291
+ "đóng băng",
2292
+ "fagot",
2293
+ "đồ ăn vặt",
2294
+ "người hâm mộ",
2295
+ "nỉ",
2296
+ "thất bại",
2297
+ "mặt bích",
2298
+ "chết tiệt",
2299
+ "quái",
2300
+ "đụ",
2301
+ "ôm",
2302
+ "đóng gói fudge",
2303
+ "địa ngục",
2304
+ "có",
2305
+ "sừng",
2306
+ "giật",
2307
+ "kock",
2308
+ "môi âm",
2309
+ "ham muốn",
2310
+ "khổ dâm",
2311
+ "thủ dâm",
2312
+ "mẹ kiếp",
2313
+ "nazi",
2314
+ "người da đen",
2315
+ "người mách nước",
2316
+ "cực khoái",
2317
+ "người mổ",
2318
+ "dương vật",
2319
+ "đi tiểu",
2320
+ "bực mình",
2321
+ "đái",
2322
+ "phân",
2323
+ "khiêu dâm",
2324
+ "nội dung khiêu dâm",
2325
+ "châm",
2326
+ "chích",
2327
+ "pube",
2328
+ "pussies",
2329
+ "âm hộ",
2330
+ "hiếp dâm",
2331
+ "trực tràng",
2332
+ "chậm phát triển",
2333
+ "xé",
2334
+ "người tàn bạo",
2335
+ "vặn vít",
2336
+ "bìu",
2337
+ "tinh dịch",
2338
+ "tình dục",
2339
+ "lông",
2340
+ "xáo trộn",
2341
+ "đồng tính",
2342
+ "cứt",
2343
+ "shite",
2344
+ "ván trượt",
2345
+ "đĩ",
2346
+ "quần lót",
2347
+ "smegma",
2348
+ "xì trum",
2349
+ "con trai",
2350
+ "spac",
2351
+ "spunk",
2352
+ "tinh hoàn",
2353
+ "ăn miếng trả miếng",
2354
+ "titt",
2355
+ "cỏ",
2356
+ "âm đạo",
2357
+ "viagra",
2358
+ "âm môn",
2359
+ "wang",
2360
+ "đã ngủ",
2361
+ "con điếm",
2362
+ "x đánh giá",
2363
+ "xxx",
2364
+ ],
2365
+ "zh": english_badwords
2366
+ + [
2367
+ "13.",
2368
+ "13点",
2369
+ "㞗",
2370
+ "三级片",
2371
+ "下三烂",
2372
+ "下贱",
2373
+ "个老子的",
2374
+ "九游",
2375
+ "乳",
2376
+ "乳交",
2377
+ "乳头",
2378
+ "乳房",
2379
+ "乳波臀浪",
2380
+ "交配",
2381
+ "仆街",
2382
+ "仆街",
2383
+ "他奶奶",
2384
+ "他奶奶的",
2385
+ "他奶娘的",
2386
+ "他妈",
2387
+ "他妈ㄉ王八蛋",
2388
+ "他妈地",
2389
+ "他妈的",
2390
+ "他娘",
2391
+ "他马的",
2392
+ "你个傻比",
2393
+ "你他马的",
2394
+ "你全家",
2395
+ "你奶奶的",
2396
+ "你她马的",
2397
+ "你妈",
2398
+ "你妈的",
2399
+ "你娘",
2400
+ "你娘卡好",
2401
+ "你娘咧",
2402
+ "你它妈的",
2403
+ "你它马的",
2404
+ "你是鸡",
2405
+ "你是鸭",
2406
+ "你老味",
2407
+ "你老母",
2408
+ "你老闆",
2409
+ "你马的",
2410
+ "做爱",
2411
+ "傻比",
2412
+ "傻逼",
2413
+ "册那",
2414
+ "冚家拎",
2415
+ "冚家鏟",
2416
+ "军妓",
2417
+ "几八",
2418
+ "几叭",
2419
+ "几巴",
2420
+ "几芭",
2421
+ "刚度",
2422
+ "刚瘪三",
2423
+ "包皮",
2424
+ "十三点",
2425
+ "卖B",
2426
+ "卖比",
2427
+ "卖淫",
2428
+ "卵",
2429
+ "卵子",
2430
+ "双峰微颤",
2431
+ "口交",
2432
+ "口肯",
2433
+ "叫床",
2434
+ "吃屎",
2435
+ "后庭",
2436
+ "吹箫",
2437
+ "咸家伶",
2438
+ "咸家鏟",
2439
+ "塞你公",
2440
+ "塞你娘",
2441
+ "塞你母",
2442
+ "塞你爸",
2443
+ "塞你老师",
2444
+ "塞你老母",
2445
+ "处女",
2446
+ "外阴",
2447
+ "大卵子",
2448
+ "大卵泡",
2449
+ "大鸡巴",
2450
+ "奶",
2451
+ "奶奶的熊",
2452
+ "奶子",
2453
+ "奸",
2454
+ "奸你",
2455
+ "她妈地",
2456
+ "她妈的",
2457
+ "她马的",
2458
+ "妈B",
2459
+ "妈个B",
2460
+ "妈个比",
2461
+ "妈个老比",
2462
+ "妈妈的",
2463
+ "妈比",
2464
+ "妈的",
2465
+ "妈的B",
2466
+ "妈逼",
2467
+ "妓",
2468
+ "妓女",
2469
+ "妓院",
2470
+ "妳她妈的",
2471
+ "妳妈的",
2472
+ "妳娘的",
2473
+ "妳老母的",
2474
+ "妳马的",
2475
+ "姘头",
2476
+ "姣西",
2477
+ "姦",
2478
+ "娘个比",
2479
+ "娘的",
2480
+ "���子",
2481
+ "婊子养的",
2482
+ "嫖娼",
2483
+ "嫖客",
2484
+ "它妈地",
2485
+ "它妈的",
2486
+ "密洞",
2487
+ "射你",
2488
+ "射精",
2489
+ "小乳头",
2490
+ "小卵子",
2491
+ "小卵泡",
2492
+ "小瘪三",
2493
+ "小肉粒",
2494
+ "小骚比",
2495
+ "小骚货",
2496
+ "小鸡巴",
2497
+ "小鸡鸡",
2498
+ "尻",
2499
+ "屁眼",
2500
+ "屁股",
2501
+ "屄",
2502
+ "屌",
2503
+ "屎忽",
2504
+ "巨乳",
2505
+ "干x娘",
2506
+ "干七八",
2507
+ "干你",
2508
+ "干你妈",
2509
+ "干你娘",
2510
+ "干你老母",
2511
+ "干你良",
2512
+ "干妳妈",
2513
+ "干妳娘",
2514
+ "干妳老母",
2515
+ "干妳马",
2516
+ "干您娘",
2517
+ "干机掰",
2518
+ "干死CS",
2519
+ "干死GM",
2520
+ "干死你",
2521
+ "干死客服",
2522
+ "幹",
2523
+ "强奸",
2524
+ "强奸你",
2525
+ "性",
2526
+ "性交",
2527
+ "性器",
2528
+ "性无能",
2529
+ "性爱",
2530
+ "情色",
2531
+ "想上你",
2532
+ "懆您妈",
2533
+ "懆您娘",
2534
+ "懒8",
2535
+ "懒八",
2536
+ "懒叫",
2537
+ "懒教",
2538
+ "成人",
2539
+ "我操你祖宗十八代",
2540
+ "扒光",
2541
+ "打炮",
2542
+ "打飞机",
2543
+ "抽插",
2544
+ "招妓",
2545
+ "插你",
2546
+ "插死你",
2547
+ "撒尿",
2548
+ "撚",
2549
+ "操你",
2550
+ "操你全家",
2551
+ "操你奶奶",
2552
+ "操你妈",
2553
+ "操你娘",
2554
+ "操你祖宗",
2555
+ "操你老妈",
2556
+ "操你老母",
2557
+ "操妳",
2558
+ "操妳全家",
2559
+ "操妳妈",
2560
+ "操妳娘",
2561
+ "操妳祖宗",
2562
+ "操机掰",
2563
+ "操比",
2564
+ "操逼",
2565
+ "放荡",
2566
+ "日他娘",
2567
+ "日你",
2568
+ "日你妈",
2569
+ "日你老娘",
2570
+ "日你老母",
2571
+ "日批",
2572
+ "月经",
2573
+ "机八",
2574
+ "机巴",
2575
+ "机机歪歪",
2576
+ "杂种",
2577
+ "柒",
2578
+ "浪叫",
2579
+ "淫",
2580
+ "淫乱",
2581
+ "淫妇",
2582
+ "淫棍",
2583
+ "淫水",
2584
+ "淫秽",
2585
+ "淫荡",
2586
+ "淫西",
2587
+ "湿透的内裤",
2588
+ "激情",
2589
+ "灨你娘",
2590
+ "烂货",
2591
+ "烂逼",
2592
+ "爛",
2593
+ "狗屁",
2594
+ "狗日",
2595
+ "狗狼养的",
2596
+ "玉杵",
2597
+ "王八蛋",
2598
+ "瓜娃子",
2599
+ "瓜婆娘",
2600
+ "瓜批",
2601
+ "瘪三",
2602
+ "白烂",
2603
+ "白痴",
2604
+ "白癡",
2605
+ "硬膠",
2606
+ "祖宗",
2607
+ "私服",
2608
+ "笨實",
2609
+ "笨蛋",
2610
+ "粉腸",
2611
+ "精子",
2612
+ "老二",
2613
+ "老味",
2614
+ "老母",
2615
+ "老瘪三",
2616
+ "老骚比",
2617
+ "老骚货",
2618
+ "肉壁",
2619
+ "肉棍子",
2620
+ "肉棒",
2621
+ "肉缝",
2622
+ "肏",
2623
+ "肛交",
2624
+ "肥西",
2625
+ "色情",
2626
+ "花柳",
2627
+ "荡妇",
2628
+ "賤",
2629
+ "贝肉",
2630
+ "贱B",
2631
+ "贱人",
2632
+ "贱货",
2633
+ "贼你妈",
2634
+ "赛你老母",
2635
+ "赛妳阿母",
2636
+ "赣您娘",
2637
+ "躝癱",
2638
+ "轮奸",
2639
+ "迷药",
2640
+ "逼",
2641
+ "逼样",
2642
+ "野鸡",
2643
+ "閪",
2644
+ "阳具",
2645
+ "阳萎",
2646
+ "阴唇",
2647
+ "阴户",
2648
+ "阴核",
2649
+ "阴毛",
2650
+ "阴茎",
2651
+ "阴道",
2652
+ "阴部",
2653
+ "陰莖",
2654
+ "雞巴",
2655
+ "靠北",
2656
+ "靠母",
2657
+ "靠爸",
2658
+ "靠背",
2659
+ "靠腰",
2660
+ "驶你公",
2661
+ "驶你娘",
2662
+ "驶你母",
2663
+ "驶你爸",
2664
+ "驶你老师",
2665
+ "驶你老母",
2666
+ "骚比",
2667
+ "骚货",
2668
+ "骚逼",
2669
+ "鬼公",
2670
+ "鳩",
2671
+ "鸡8",
2672
+ "鸡八",
2673
+ "鸡叭",
2674
+ "鸡吧",
2675
+ "鸡奸",
2676
+ "鸡巴",
2677
+ "鸡芭",
2678
+ "鸡鸡",
2679
+ "龟儿子",
2680
+ "龟头",
2681
+ ],
2682
+ }
en.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e90c9b25af01dcaa2667ed45d012d891269760fc6eccfe8dbbd161eb20e01d7d
3
+ size 4403509656
en.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:262c0b0bd4ebc592e439453bc7e006d0ed12d1914e206a1fb8c7fba091f52c4d
3
+ size 1389058
filtering.py ADDED
@@ -0,0 +1,879 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ import numpy as np
4
+
5
+ import fasttext
6
+
7
+ import sentencepiece
8
+ import kenlm
9
+
10
+ import pathlib
11
+
12
+ from languages_id import langs_id
13
+ from parameters_filtering import parameters_filtering
14
+ from normalization import normalization
15
+ from stopwords import stopwords
16
+ from badwords import badwords
17
+
18
+
19
+ class LoadParameters:
20
+ @staticmethod
21
+ def load_parameters(lang_dataset_id):
22
+ if lang_dataset_id in parameters_filtering:
23
+ param = parameters_filtering[lang_dataset_id]
24
+ else:
25
+ param = parameters_filtering["default"]
26
+ return param
27
+
28
+ @staticmethod
29
+ def load_stopwords(lang_dataset_id):
30
+ stopwords_lang_id = langs_id.loc[
31
+ langs_id["dataset_id"] == lang_dataset_id, "stopwords_id"
32
+ ].iloc[0]
33
+ if stopwords_lang_id:
34
+ stopwords_lang = set(stopwords[stopwords_lang_id])
35
+ else:
36
+ stopwords_lang = None
37
+ return stopwords_lang
38
+
39
+ @staticmethod
40
+ def load_badwords(lang_dataset_id):
41
+ badwords_lang_id = langs_id.loc[
42
+ langs_id["dataset_id"] == lang_dataset_id, "badwords_id"
43
+ ].iloc[0]
44
+ if badwords_lang_id:
45
+ badwords_lang = set(badwords[badwords_lang_id])
46
+ else:
47
+ badwords_lang = None
48
+ return badwords_lang
49
+
50
+ @staticmethod
51
+ def load_model_lang_id(lang_dataset_id, path_fasttext_model):
52
+ fasttext_lang_id = langs_id.loc[
53
+ langs_id["dataset_id"] == lang_dataset_id, "fasttext_id"
54
+ ].iloc[0]
55
+ if fasttext_lang_id:
56
+ model_lang_id = fasttext.load_model(path_fasttext_model)
57
+ else:
58
+ model_lang_id = None
59
+ return model_lang_id
60
+
61
+ @staticmethod
62
+ def load_sentencepiece_model(lang_dataset_id, path_sentencepiece_model):
63
+ sentencepiece_lang_id = langs_id.loc[
64
+ langs_id["dataset_id"] == lang_dataset_id, "sentencepiece_id"
65
+ ].iloc[0]
66
+ if sentencepiece_lang_id:
67
+ sentencepiece_model = sentencepiece.SentencePieceProcessor()
68
+ sentencepiece_model.load(path_sentencepiece_model)
69
+ else:
70
+ sentencepiece_model = None
71
+ return sentencepiece_model
72
+
73
+ @staticmethod
74
+ def load_kenlm_model(lang_dataset_id, path_kenlm_model):
75
+ kenlm_lang_id = langs_id.loc[
76
+ langs_id["dataset_id"] == lang_dataset_id, "kenlm_id"
77
+ ].iloc[0]
78
+ if kenlm_lang_id:
79
+ kenlm_model = kenlm.Model(path_kenlm_model)
80
+ else:
81
+ kenlm_model = None
82
+ return kenlm_model
83
+
84
+
85
+ class ModifyingDocuments:
86
+ @staticmethod
87
+ def remove_empty_el_from_list(list_):
88
+ return [el for el in list_ if el]
89
+
90
+ @staticmethod
91
+ def remove_non_printing_characters(document, non_printing_characters_re):
92
+ return non_printing_characters_re.sub("", document)
93
+
94
+ @staticmethod
95
+ def uniform_whitespace(
96
+ document,
97
+ whitespace=[
98
+ " ",
99
+ " ",
100
+ " ",
101
+ " ",
102
+ " ",
103
+ " ",
104
+ " ",
105
+ " ",
106
+ " ",
107
+ " ",
108
+ "",
109
+ "„",
110
+ ],
111
+ ):
112
+ """There are different whitespace characters."""
113
+ whitespace = set(whitespace)
114
+ document = "".join(
115
+ [char if char not in whitespace else " " for char in document]
116
+ )
117
+ return document
118
+
119
+ @staticmethod
120
+ def replace_digits_with_zeros(document, digits_re):
121
+ return digits_re.sub("0", document)
122
+
123
+ @staticmethod
124
+ def replace_unicode_punctuation(document, unicode_punctuation):
125
+ return "".join(unicode_punctuation.get(c, c) for c in document)
126
+
127
+ @staticmethod
128
+ def normalization(
129
+ document,
130
+ remove_non_printing_characters,
131
+ strip,
132
+ lower_case,
133
+ uniform_whitespace,
134
+ replace_digits_with_zeros,
135
+ replace_unicode_punctuation,
136
+ non_printing_characters_re=normalization["non_printing_characters_re"],
137
+ digits_re=normalization["digits_re"],
138
+ unicode_punctuation=normalization["unicode_punctuation"],
139
+ ):
140
+ if remove_non_printing_characters:
141
+ document = ModifyingDocuments.remove_non_printing_characters(
142
+ document, non_printing_characters_re
143
+ )
144
+ if strip:
145
+ document = document.strip()
146
+ if not document:
147
+ return document
148
+ if lower_case:
149
+ document = document.lower()
150
+ if uniform_whitespace:
151
+ document = ModifyingDocuments.uniform_whitespace(document)
152
+ if replace_digits_with_zeros:
153
+ document = ModifyingDocuments.replace_digits_with_zeros(document, digits_re)
154
+ if replace_unicode_punctuation:
155
+ document = ModifyingDocuments.replace_unicode_punctuation(
156
+ document, unicode_punctuation
157
+ )
158
+ return document
159
+
160
+ @staticmethod
161
+ def tokenization(document, sentencepiece_model, join_on_whitespace):
162
+ document_tokenized = sentencepiece_model.encode_as_pieces(document)
163
+ if join_on_whitespace:
164
+ document_tokenized = " ".join(document_tokenized)
165
+ return document_tokenized
166
+
167
+ @staticmethod
168
+ def split_on_whitespace(
169
+ document,
170
+ new_line=False,
171
+ tab=False,
172
+ ):
173
+ """This method also removes concatenated spaces."""
174
+ sep = [" "] + new_line * ["\n"] + tab * ["\t"]
175
+ sep = "|".join(sep)
176
+ split_document = re.split(sep, document)
177
+ split_document = ModifyingDocuments.remove_empty_el_from_list(split_document)
178
+ return split_document
179
+
180
+ @staticmethod
181
+ def strip(document, strip_characters):
182
+ """Way faster than document.strip(strip_characters)
183
+ since strip_characters is now a set instead of a str,
184
+ and it contains a lot of elements (all the emojis)."""
185
+ if not document:
186
+ return document
187
+ beg_ind = 0
188
+ end_ind = len(document)
189
+ for i in range(len(document)):
190
+ if document[i] in strip_characters:
191
+ beg_ind += 1
192
+ else:
193
+ break
194
+ for i in range(1, len(document) + 1):
195
+ if document[-i] in strip_characters:
196
+ end_ind -= 1
197
+ else:
198
+ break
199
+ document_stripped = document[beg_ind:end_ind]
200
+ return document_stripped
201
+
202
+ @staticmethod
203
+ def get_words_from_document(
204
+ document, sentencepiece_model_tok, lower_case, strip_characters
205
+ ):
206
+ """Get words from a document. Non reversible since the document
207
+ is split on multiple characters, words are stripped of
208
+ special characters and characters are converted to lower case.
209
+ Useful to compute ratios, like the stopwords ratio."""
210
+ if sentencepiece_model_tok:
211
+ document_normalized = ModifyingDocuments.normalization(
212
+ document=document,
213
+ remove_non_printing_characters=True,
214
+ strip=True,
215
+ lower_case=True,
216
+ uniform_whitespace=True,
217
+ replace_digits_with_zeros=True,
218
+ replace_unicode_punctuation=True,
219
+ )
220
+ words = ModifyingDocuments.tokenization(
221
+ document_normalized, sentencepiece_model_tok, join_on_whitespace=False
222
+ )
223
+ else:
224
+ words = ModifyingDocuments.split_on_whitespace(
225
+ document, new_line=True, tab=True
226
+ )
227
+ if lower_case:
228
+ words = [word.lower() for word in words]
229
+ if strip_characters:
230
+ words = [ModifyingDocuments.strip(word, strip_characters) for word in words]
231
+ words = ModifyingDocuments.remove_empty_el_from_list(words)
232
+ return words
233
+
234
+ @staticmethod
235
+ def words_augmentation(words, group_size, join_char):
236
+ """Augment words, especially for Chinese (without a space between words)
237
+ and Vietnamese (with a space between syllables)."""
238
+ augmentation = [
239
+ join_char.join(words[i : i + group_size])
240
+ for i in range(len(words) - group_size + 1)
241
+ ]
242
+ return augmentation
243
+
244
+ @staticmethod
245
+ def split_on_newline_tab_whitespace(document):
246
+ """First split on "\n", then on "\t", then on " "."""
247
+ sentences = document.split("\n")
248
+ sentences = [sentence.split("\t") for sentence in sentences]
249
+ sentences = [
250
+ [
251
+ ModifyingDocuments.split_on_whitespace(subsentence)
252
+ for subsentence in sentence
253
+ ]
254
+ for sentence in sentences
255
+ ]
256
+ return sentences
257
+
258
+ @staticmethod
259
+ def merge_on_whitespace_tab_newline(sentences):
260
+ """Invert the method split_on_newline_tab_whitespace.
261
+ Removes concatenated separators."""
262
+ sentences = [
263
+ [" ".join(subsentence) for subsentence in sentence if subsentence]
264
+ for sentence in sentences
265
+ ]
266
+ sentences = ["\t".join(sentence) for sentence in sentences if sentence]
267
+ if not sentences:
268
+ return ""
269
+ document = "\n".join(sentences)
270
+ return document
271
+
272
+ @staticmethod
273
+ def should_keep_word_with_incorrect_substrings(
274
+ word, strip_characters, incorrect_word_substrings
275
+ ):
276
+ word = ModifyingDocuments.strip(word, strip_characters)
277
+ should_keep = all(
278
+ [(i_substr not in word) for i_substr in incorrect_word_substrings]
279
+ )
280
+ return should_keep
281
+
282
+ @staticmethod
283
+ def remove_words_with_incorrect_substrings(
284
+ document,
285
+ strip_characters,
286
+ incorrect_word_substrings,
287
+ ):
288
+ sentences = ModifyingDocuments.split_on_newline_tab_whitespace(document)
289
+ sentences = [
290
+ [
291
+ [
292
+ word
293
+ for word in subsentence
294
+ if ModifyingDocuments.should_keep_word_with_incorrect_substrings(
295
+ word, strip_characters, incorrect_word_substrings
296
+ )
297
+ ]
298
+ for subsentence in sentence
299
+ ]
300
+ for sentence in sentences
301
+ ]
302
+ document = ModifyingDocuments.merge_on_whitespace_tab_newline(sentences)
303
+ return document
304
+
305
+ @staticmethod
306
+ def should_keep_long_word(word, strip_characters, length_word_max_cutoff):
307
+ """If the word is too long but it contains only one
308
+ special character, it might be a concatenation of one word,
309
+ a punctuation, and another word, with no space between them.
310
+ In this case, we give the word a pass."""
311
+ if len(word) <= length_word_max_cutoff:
312
+ return True
313
+ word = ModifyingDocuments.strip(word, strip_characters)
314
+ if not word: # The word consisted only of strip characters
315
+ return False
316
+ if len(word) <= length_word_max_cutoff:
317
+ return True
318
+ return False
319
+
320
+ def remove_long_words(
321
+ document,
322
+ strip_characters,
323
+ length_word_max_cutoff,
324
+ ):
325
+ sentences = ModifyingDocuments.split_on_newline_tab_whitespace(document)
326
+ sentences = [
327
+ [
328
+ [
329
+ word
330
+ for word in subsentence
331
+ if ModifyingDocuments.should_keep_long_word(
332
+ word,
333
+ strip_characters,
334
+ length_word_max_cutoff,
335
+ )
336
+ ]
337
+ for subsentence in sentence
338
+ ]
339
+ for sentence in sentences
340
+ ]
341
+ document = ModifyingDocuments.merge_on_whitespace_tab_newline(sentences)
342
+ return document
343
+
344
+ @staticmethod
345
+ def modifying_documents(
346
+ document,
347
+ cond_uniform_whitespace,
348
+ cond_replace_unicode_punctuation,
349
+ cond_remove_words_with_incorrect_substrings,
350
+ strip_characters,
351
+ incorrect_word_substrings,
352
+ cond_remove_long_words,
353
+ length_word_max_cutoff,
354
+ ):
355
+ document = ModifyingDocuments.normalization(
356
+ document=document,
357
+ remove_non_printing_characters=False,
358
+ strip=True,
359
+ lower_case=False,
360
+ uniform_whitespace=cond_uniform_whitespace,
361
+ replace_digits_with_zeros=False,
362
+ replace_unicode_punctuation=cond_replace_unicode_punctuation,
363
+ )
364
+ if cond_remove_words_with_incorrect_substrings:
365
+ document = ModifyingDocuments.remove_words_with_incorrect_substrings(
366
+ document,
367
+ strip_characters,
368
+ incorrect_word_substrings,
369
+ )
370
+ if cond_remove_long_words:
371
+ document = ModifyingDocuments.remove_long_words(
372
+ document,
373
+ strip_characters,
374
+ length_word_max_cutoff,
375
+ )
376
+ return document
377
+
378
+
379
+ class FunctionDatasetModifyingDocuments:
380
+ def __init__(self, lang_dataset_id):
381
+ self.lang_dataset_id = lang_dataset_id
382
+ self.param = LoadParameters.load_parameters(lang_dataset_id)
383
+
384
+ def __call__(self, example):
385
+ example["text"] = ModifyingDocuments.modifying_documents(
386
+ document=example["text"],
387
+ cond_uniform_whitespace=self.param["cond_uniform_whitespace"],
388
+ cond_replace_unicode_punctuation=self.param[
389
+ "cond_replace_unicode_punctuation"
390
+ ],
391
+ cond_remove_words_with_incorrect_substrings=self.param[
392
+ "cond_remove_words_with_incorrect_substrings"
393
+ ],
394
+ strip_characters=self.param["strip_characters"],
395
+ incorrect_word_substrings=self.param["incorrect_word_substrings"],
396
+ cond_remove_long_words=self.param["cond_remove_long_words"],
397
+ length_word_max_cutoff=self.param["length_word_max_cutoff"],
398
+ )
399
+ return example
400
+
401
+ def __reduce__(self):
402
+ return (self.__class__, (self.lang_dataset_id,))
403
+
404
+
405
+ class Filtering:
406
+ @staticmethod
407
+ def check_number_words(
408
+ document,
409
+ sentencepiece_model_tok,
410
+ strip_characters,
411
+ number_words_min_cutoff,
412
+ number_words_max_cutoff,
413
+ ):
414
+ words = ModifyingDocuments.get_words_from_document(
415
+ document,
416
+ sentencepiece_model_tok,
417
+ lower_case=False,
418
+ strip_characters=strip_characters,
419
+ )
420
+ cond = (len(words) >= number_words_min_cutoff) and (
421
+ len(words) <= number_words_max_cutoff
422
+ )
423
+ return cond
424
+
425
+ @staticmethod
426
+ def compute_repetitions_ratio(document, repetitions_length):
427
+ def get_freq_ngrams(document, n):
428
+ ngrams = [document[i : i + n] for i in range(len(document) - n + 1)]
429
+ freq_ngrams = {}
430
+ for ngram in ngrams:
431
+ freq_ngrams[ngram] = freq_ngrams.get(ngram, 0) + 1
432
+ return freq_ngrams
433
+
434
+ freq_ngrams = get_freq_ngrams(document, repetitions_length)
435
+ if len(freq_ngrams) == 0:
436
+ return 0
437
+ freq_ngrams = list(freq_ngrams.values())
438
+ freq_ngrams = sorted(freq_ngrams, reverse=True)
439
+ num_rep_ngrams = int(np.sqrt(len(freq_ngrams)))
440
+ repetitions_ratio = sum(freq_ngrams[:num_rep_ngrams]) / sum(freq_ngrams)
441
+ return repetitions_ratio
442
+
443
+ @staticmethod
444
+ def check_repetitions_removal(
445
+ document,
446
+ repetitions_length,
447
+ repetitions_max_cutoff,
448
+ ):
449
+ repetitions_ratio = Filtering.compute_repetitions_ratio(
450
+ document, repetitions_length
451
+ )
452
+ cond = repetitions_ratio <= repetitions_max_cutoff
453
+ return cond
454
+
455
+ @staticmethod
456
+ def compute_special_characters_ratio(document, special_characters):
457
+ special_characters_ratio = len(
458
+ [char for char in document if char in special_characters]
459
+ ) / len(document)
460
+ return special_characters_ratio
461
+
462
+ @staticmethod
463
+ def check_special_characters(
464
+ document,
465
+ special_characters,
466
+ special_characters_max_cutoff,
467
+ ):
468
+ special_characters_ratio = Filtering.compute_special_characters_ratio(
469
+ document, special_characters
470
+ )
471
+ cond = special_characters_ratio <= special_characters_max_cutoff
472
+ return cond
473
+
474
+ @staticmethod
475
+ def compute_stopwords_ratio(
476
+ document,
477
+ sentencepiece_model_tok,
478
+ strip_characters,
479
+ cond_words_augmentation,
480
+ words_augmentation_group_sizes,
481
+ words_augmentation_join_char,
482
+ stopwords,
483
+ ):
484
+ words = ModifyingDocuments.get_words_from_document(
485
+ document,
486
+ sentencepiece_model_tok,
487
+ lower_case=True,
488
+ strip_characters=strip_characters,
489
+ )
490
+ if not words:
491
+ return 0
492
+ augmentation = []
493
+ if cond_words_augmentation:
494
+ augmentation = [
495
+ ModifyingDocuments.words_augmentation(
496
+ words, group_size, words_augmentation_join_char
497
+ )
498
+ for group_size in words_augmentation_group_sizes
499
+ ]
500
+ augmentation = [word for augm in augmentation for word in augm]
501
+ stopwords_ratio = len(
502
+ [word for word in words + augmentation if word in stopwords]
503
+ ) / len(words)
504
+ if stopwords_ratio > 1.0:
505
+ stopwords_ratio = 1.0
506
+ return stopwords_ratio
507
+
508
+ @staticmethod
509
+ def check_stopwords(
510
+ document,
511
+ sentencepiece_model_tok,
512
+ strip_characters,
513
+ cond_words_augmentation,
514
+ words_augmentation_group_sizes,
515
+ words_augmentation_join_char,
516
+ stopwords,
517
+ stopwords_min_cutoff,
518
+ ):
519
+ cond = True
520
+ if stopwords:
521
+ stopwords_ratio = Filtering.compute_stopwords_ratio(
522
+ document,
523
+ sentencepiece_model_tok,
524
+ strip_characters,
525
+ cond_words_augmentation,
526
+ words_augmentation_group_sizes,
527
+ words_augmentation_join_char,
528
+ stopwords,
529
+ )
530
+ cond = stopwords_ratio >= stopwords_min_cutoff
531
+ return cond
532
+
533
+ @staticmethod
534
+ def compute_badwords_ratio(
535
+ document,
536
+ sentencepiece_model_tok,
537
+ strip_characters,
538
+ cond_words_augmentation,
539
+ words_augmentation_group_sizes,
540
+ words_augmentation_join_char,
541
+ badwords,
542
+ ):
543
+ words = ModifyingDocuments.get_words_from_document(
544
+ document,
545
+ sentencepiece_model_tok,
546
+ lower_case=True,
547
+ strip_characters=strip_characters,
548
+ )
549
+ if not words:
550
+ return 0
551
+ augmentation = []
552
+ if cond_words_augmentation:
553
+ augmentation = [
554
+ ModifyingDocuments.words_augmentation(
555
+ words, group_size, words_augmentation_join_char
556
+ )
557
+ for group_size in words_augmentation_group_sizes
558
+ ]
559
+ augmentation = [word for augm in augmentation for word in augm]
560
+ badwords_ratio = len(
561
+ [word for word in words + augmentation if word in badwords]
562
+ ) / len(words)
563
+ if badwords_ratio > 1.0:
564
+ badwords_ratio = 1.0
565
+ for word in augmentation:
566
+ if word in badwords:
567
+ print(word)
568
+ return badwords_ratio
569
+
570
+ @staticmethod
571
+ def check_badwords(
572
+ document,
573
+ sentencepiece_model_tok,
574
+ strip_characters,
575
+ cond_words_augmentation,
576
+ words_augmentation_group_sizes,
577
+ words_augmentation_join_char,
578
+ badwords,
579
+ badwords_max_cutoff,
580
+ ):
581
+ cond = True
582
+ if badwords:
583
+ badwords_ratio = Filtering.compute_badwords_ratio(
584
+ document,
585
+ sentencepiece_model_tok,
586
+ strip_characters,
587
+ cond_words_augmentation,
588
+ words_augmentation_group_sizes,
589
+ words_augmentation_join_char,
590
+ badwords,
591
+ )
592
+ cond = badwords_ratio <= badwords_max_cutoff
593
+ return cond
594
+
595
+ @staticmethod
596
+ def compute_lang_id_pred_score(document, model_lang_id):
597
+ document = document.lower().replace("\n", " ")
598
+ pred = model_lang_id.predict(document)
599
+ lang_pred_fasttext_id = pred[0][0].replace("__label__", "")
600
+ score_pred = pred[1][0]
601
+ lang_pred_dataset_id = langs_id.loc[
602
+ langs_id["fasttext_id"] == lang_pred_fasttext_id, "dataset_id"
603
+ ]
604
+ if len(lang_pred_dataset_id) > 0:
605
+ lang_pred_dataset_id = lang_pred_dataset_id.iloc[0]
606
+ else:
607
+ lang_pred_dataset_id = "unknown"
608
+ return lang_pred_dataset_id, score_pred
609
+
610
+ @staticmethod
611
+ def check_lang_id(
612
+ document,
613
+ lang_dataset_id,
614
+ model_lang_id,
615
+ lang_id_min_cutoff,
616
+ ):
617
+ cond = True
618
+ if model_lang_id:
619
+ lang_pred_dataset_id, score_pred = Filtering.compute_lang_id_pred_score(
620
+ document, model_lang_id
621
+ )
622
+ cond = (lang_pred_dataset_id == lang_dataset_id) and (
623
+ score_pred >= lang_id_min_cutoff
624
+ )
625
+ return cond
626
+
627
+ @staticmethod
628
+ def compute_perplexity_score(document, sentencepiece_model, kenlm_model):
629
+ document = ModifyingDocuments.normalization(
630
+ document=document,
631
+ remove_non_printing_characters=True,
632
+ strip=True,
633
+ lower_case=True,
634
+ uniform_whitespace=True,
635
+ replace_digits_with_zeros=True,
636
+ replace_unicode_punctuation=True,
637
+ )
638
+ document = ModifyingDocuments.tokenization(
639
+ document, sentencepiece_model, join_on_whitespace=True
640
+ )
641
+ doc_log_score, doc_length = 0, 0
642
+ for line in document.split("\n"):
643
+ log_score = kenlm_model.score(line)
644
+ length = len(line.split()) + 1
645
+ doc_log_score += log_score
646
+ doc_length += length
647
+ pp_score = 10.0 ** (-doc_log_score / doc_length)
648
+ pp_score = round(pp_score, 1)
649
+ return pp_score
650
+
651
+ @staticmethod
652
+ def check_perplexity(
653
+ document,
654
+ sentencepiece_model,
655
+ kenlm_model,
656
+ perplexity_max_cutoff,
657
+ ):
658
+ cond = True
659
+ if kenlm_model:
660
+ score = Filtering.compute_perplexity_score(
661
+ document, sentencepiece_model, kenlm_model
662
+ )
663
+ cond = score <= perplexity_max_cutoff
664
+ return cond
665
+
666
+ @staticmethod
667
+ def filtering(
668
+ document,
669
+ cond_check_number_words,
670
+ sentencepiece_model_tok,
671
+ strip_characters,
672
+ number_words_min_cutoff,
673
+ number_words_max_cutoff,
674
+ cond_check_repetitions_removal,
675
+ repetitions_length,
676
+ repetitions_max_cutoff,
677
+ cond_check_special_characters,
678
+ special_characters,
679
+ special_characters_max_cutoff,
680
+ cond_words_augmentation,
681
+ words_augmentation_group_sizes,
682
+ words_augmentation_join_char,
683
+ cond_check_stopwords,
684
+ stopwords,
685
+ stopwords_min_cutoff,
686
+ cond_check_badwords,
687
+ badwords,
688
+ badwords_max_cutoff,
689
+ cond_check_lang_id,
690
+ lang_dataset_id,
691
+ model_lang_id,
692
+ lang_id_min_cutoff,
693
+ cond_check_perplexity,
694
+ sentencepiece_model,
695
+ kenlm_model,
696
+ perplexity_max_cutoff,
697
+ ):
698
+ if cond_check_number_words:
699
+ if not Filtering.check_number_words(
700
+ document,
701
+ sentencepiece_model_tok,
702
+ strip_characters,
703
+ number_words_min_cutoff,
704
+ number_words_max_cutoff,
705
+ ):
706
+ return False
707
+ if cond_check_repetitions_removal:
708
+ if not Filtering.check_repetitions_removal(
709
+ document,
710
+ repetitions_length,
711
+ repetitions_max_cutoff,
712
+ ):
713
+ return False
714
+ if cond_check_special_characters:
715
+ if not Filtering.check_special_characters(
716
+ document,
717
+ special_characters,
718
+ special_characters_max_cutoff,
719
+ ):
720
+ return False
721
+ if cond_check_stopwords:
722
+ if not Filtering.check_stopwords(
723
+ document,
724
+ sentencepiece_model_tok,
725
+ strip_characters,
726
+ cond_words_augmentation,
727
+ words_augmentation_group_sizes,
728
+ words_augmentation_join_char,
729
+ stopwords,
730
+ stopwords_min_cutoff,
731
+ ):
732
+ return False
733
+ if cond_check_badwords:
734
+ if not Filtering.check_badwords(
735
+ document,
736
+ sentencepiece_model_tok,
737
+ strip_characters,
738
+ cond_words_augmentation,
739
+ words_augmentation_group_sizes,
740
+ words_augmentation_join_char,
741
+ badwords,
742
+ badwords_max_cutoff,
743
+ ):
744
+ return False
745
+ if cond_check_lang_id:
746
+ if not Filtering.check_lang_id(
747
+ document,
748
+ lang_dataset_id,
749
+ model_lang_id,
750
+ lang_id_min_cutoff,
751
+ ):
752
+ return False
753
+ if cond_check_perplexity:
754
+ if not Filtering.check_perplexity(
755
+ document,
756
+ sentencepiece_model,
757
+ kenlm_model,
758
+ perplexity_max_cutoff,
759
+ ):
760
+ return False
761
+ return True
762
+
763
+
764
+ class FunctionDatasetFiltering:
765
+ def __init__(
766
+ self,
767
+ lang_dataset_id,
768
+ path_fasttext_model,
769
+ path_sentencepiece_model,
770
+ path_kenlm_model,
771
+ ):
772
+ self.lang_dataset_id = lang_dataset_id
773
+ self.path_fasttext_model = path_fasttext_model
774
+ self.path_sentencepiece_model = path_sentencepiece_model
775
+ self.path_kenlm_model = path_kenlm_model
776
+
777
+ self.param = LoadParameters.load_parameters(lang_dataset_id)
778
+ self.stopwords = LoadParameters.load_stopwords(lang_dataset_id)
779
+ self.badwords = LoadParameters.load_badwords(lang_dataset_id)
780
+ self.model_lang_id = LoadParameters.load_model_lang_id(
781
+ lang_dataset_id, path_fasttext_model
782
+ )
783
+ self.sentencepiece_model = LoadParameters.load_sentencepiece_model(
784
+ lang_dataset_id, path_sentencepiece_model
785
+ )
786
+ self.sentencepiece_model_tok = (
787
+ self.sentencepiece_model if self.param["tokenization"] else None
788
+ )
789
+ self.kenlm_model = LoadParameters.load_kenlm_model(
790
+ lang_dataset_id, path_kenlm_model
791
+ )
792
+
793
+ def __call__(self, example):
794
+ keep_example = Filtering.filtering(
795
+ document=example["text"],
796
+ cond_check_number_words=self.param["cond_check_number_words"],
797
+ sentencepiece_model_tok=self.sentencepiece_model_tok,
798
+ strip_characters=self.param["strip_characters"],
799
+ number_words_min_cutoff=self.param["number_words_min_cutoff"],
800
+ number_words_max_cutoff=self.param["number_words_max_cutoff"],
801
+ cond_check_repetitions_removal=self.param["check_repetitions_removal"],
802
+ repetitions_length=self.param["repetitions_length"],
803
+ repetitions_max_cutoff=self.param["repetitions_max_cutoff"],
804
+ cond_check_special_characters=self.param["cond_check_special_characters"],
805
+ special_characters=self.param["special_characters"],
806
+ special_characters_max_cutoff=self.param["special_characters_max_cutoff"],
807
+ cond_words_augmentation=self.param["cond_words_augmentation"],
808
+ words_augmentation_group_sizes=self.param["words_augmentation_group_sizes"],
809
+ words_augmentation_join_char=self.param["words_augmentation_join_char"],
810
+ cond_check_stopwords=self.param["cond_check_stopwords"],
811
+ stopwords=self.stopwords,
812
+ stopwords_min_cutoff=self.param["stopwords_min_cutoff"],
813
+ cond_check_badwords=self.param["cond_check_badwords"],
814
+ badwords=self.badwords,
815
+ badwords_max_cutoff=self.param["badwords_max_cutoff"],
816
+ cond_check_lang_id=self.param["cond_check_lang_id"],
817
+ lang_dataset_id=self.lang_dataset_id,
818
+ model_lang_id=self.model_lang_id,
819
+ lang_id_min_cutoff=self.param["lang_id_min_cutoff"],
820
+ cond_check_perplexity=self.param["cond_check_perplexity"],
821
+ sentencepiece_model=self.sentencepiece_model,
822
+ kenlm_model=self.kenlm_model,
823
+ perplexity_max_cutoff=self.param["perplexity_max_cutoff"],
824
+ )
825
+ return keep_example
826
+
827
+ def __reduce__(self):
828
+ return (
829
+ self.__class__,
830
+ (
831
+ self.lang_dataset_id,
832
+ self.path_fasttext_model,
833
+ self.path_sentencepiece_model,
834
+ self.path_kenlm_model,
835
+ ),
836
+ )
837
+
838
+
839
+ class DatasetFiltering:
840
+ def __init__(
841
+ self,
842
+ dataset,
843
+ lang_dataset_id,
844
+ path_fasttext_model,
845
+ path_sentencepiece_model,
846
+ path_kenlm_model,
847
+ num_proc,
848
+ path_dir_save_dataset,
849
+ ):
850
+ self.ds = dataset
851
+ self.lang_dataset_id = lang_dataset_id
852
+ self.path_fasttext_model = path_fasttext_model
853
+ self.path_sentencepiece_model = path_sentencepiece_model
854
+ self.path_kenlm_model = path_kenlm_model
855
+ self.num_proc = num_proc
856
+ self.path_dir_save_dataset = path_dir_save_dataset
857
+
858
+ def modifying_documents(self):
859
+ dataset_modifying_documents = FunctionDatasetModifyingDocuments(
860
+ self.lang_dataset_id
861
+ )
862
+ self.ds = self.ds.map(dataset_modifying_documents, num_proc=self.num_proc)
863
+
864
+ def filtering(self):
865
+ func_dataset_filtering = FunctionDatasetFiltering(
866
+ self.lang_dataset_id,
867
+ self.path_fasttext_model,
868
+ self.path_sentencepiece_model,
869
+ self.path_kenlm_model,
870
+ )
871
+ self.ds = self.ds.filter(func_dataset_filtering, num_proc=self.num_proc)
872
+
873
+ def save_dataset(self):
874
+ pathlib.Path(self.path_dir_save_dataset).mkdir(parents=True, exist_ok=True)
875
+ path_dir_save_dataset = pathlib.PurePath(
876
+ self.path_dir_save_dataset, self.lang_dataset_id
877
+ )
878
+ pathlib.Path(path_dir_save_dataset).mkdir(parents=True, exist_ok=True)
879
+ self.ds.save_to_disk(path_dir_save_dataset)
languages_id.py ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+
3
+
4
+ langs_id = [
5
+ {
6
+ "lang": "Afrikaans",
7
+ "dataset_id": "af",
8
+ "stopwords_id": "af",
9
+ "badwords_id": None,
10
+ "fasttext_id": "af",
11
+ "sentencepiece_id": "af",
12
+ "kenlm_id": "af",
13
+ },
14
+ {
15
+ "lang": "Arabic",
16
+ "dataset_id": "ar",
17
+ "stopwords_id": "ar",
18
+ "badwords_id": "ar",
19
+ "fasttext_id": "ar",
20
+ "sentencepiece_id": "ar",
21
+ "kenlm_id": "ar",
22
+ },
23
+ {
24
+ "lang": "Egyptian Arabic",
25
+ "dataset_id": "arz",
26
+ "stopwords_id": None,
27
+ "badwords_id": None,
28
+ "fasttext_id": "arz",
29
+ "sentencepiece_id": None,
30
+ "kenlm_id": None,
31
+ },
32
+ {
33
+ "lang": "Assamese",
34
+ "dataset_id": "as",
35
+ "stopwords_id": None,
36
+ "badwords_id": None,
37
+ "fasttext_id": "as",
38
+ "sentencepiece_id": None,
39
+ "kenlm_id": None,
40
+ },
41
+ {
42
+ "lang": "Bengali",
43
+ "dataset_id": "bn",
44
+ "stopwords_id": "bn",
45
+ "badwords_id": None,
46
+ "fasttext_id": "bn",
47
+ "sentencepiece_id": "bn",
48
+ "kenlm_id": "bn",
49
+ },
50
+ {
51
+ "lang": "Catalan",
52
+ "dataset_id": "ca",
53
+ "stopwords_id": "ca",
54
+ "badwords_id": "ca",
55
+ "fasttext_id": "ca",
56
+ "sentencepiece_id": "ca",
57
+ "kenlm_id": "ca",
58
+ },
59
+ {
60
+ "lang": "English",
61
+ "dataset_id": "en",
62
+ "stopwords_id": "en",
63
+ "badwords_id": "en",
64
+ "fasttext_id": "en",
65
+ "sentencepiece_id": "en",
66
+ "kenlm_id": "en",
67
+ },
68
+ {
69
+ "lang": "Spanish",
70
+ "dataset_id": "es",
71
+ "stopwords_id": "es",
72
+ "badwords_id": "es",
73
+ "fasttext_id": "es",
74
+ "sentencepiece_id": "es",
75
+ "kenlm_id": "es",
76
+ },
77
+ {
78
+ "lang": "Basque",
79
+ "dataset_id": "eu",
80
+ "stopwords_id": "eu",
81
+ "badwords_id": "eu",
82
+ "fasttext_id": "eu",
83
+ "sentencepiece_id": None,
84
+ "kenlm_id": None,
85
+ },
86
+ {
87
+ "lang": "French",
88
+ "dataset_id": "fr",
89
+ "stopwords_id": "fr",
90
+ "badwords_id": "fr",
91
+ "fasttext_id": "fr",
92
+ "sentencepiece_id": "fr",
93
+ "kenlm_id": "fr",
94
+ },
95
+ {
96
+ "lang": "Gujarati",
97
+ "dataset_id": "gu",
98
+ "stopwords_id": None,
99
+ "badwords_id": None,
100
+ "fasttext_id": "gu",
101
+ "sentencepiece_id": "gu",
102
+ "kenlm_id": "gu",
103
+ },
104
+ {
105
+ "lang": "Hindi",
106
+ "dataset_id": "hi",
107
+ "stopwords_id": "hi",
108
+ "badwords_id": "hi",
109
+ "fasttext_id": "hi",
110
+ "sentencepiece_id": "hi",
111
+ "kenlm_id": "hi",
112
+ },
113
+ {
114
+ "lang": "Indonesian",
115
+ "dataset_id": "id",
116
+ "stopwords_id": "id",
117
+ "badwords_id": "id",
118
+ "fasttext_id": "id",
119
+ "sentencepiece_id": "id",
120
+ "kenlm_id": "id",
121
+ },
122
+ {
123
+ "lang": "Kannada",
124
+ "dataset_id": "kn",
125
+ "stopwords_id": None,
126
+ "badwords_id": "kn",
127
+ "fasttext_id": "kn",
128
+ "sentencepiece_id": "kn",
129
+ "kenlm_id": "kn",
130
+ },
131
+ {
132
+ "lang": "Malayalam",
133
+ "dataset_id": "ml",
134
+ "stopwords_id": None,
135
+ "badwords_id": "ml",
136
+ "fasttext_id": "ml",
137
+ "sentencepiece_id": "ml",
138
+ "kenlm_id": "ml",
139
+ },
140
+ {
141
+ "lang": "Marathi",
142
+ "dataset_id": "mr",
143
+ "stopwords_id": "mr",
144
+ "badwords_id": "mr",
145
+ "fasttext_id": "mr",
146
+ "sentencepiece_id": "mr",
147
+ "kenlm_id": "mr",
148
+ },
149
+ {
150
+ "lang": "Portuguese",
151
+ "dataset_id": "pt",
152
+ "stopwords_id": "pt",
153
+ "badwords_id": "pt",
154
+ "fasttext_id": "pt",
155
+ "sentencepiece_id": "pt",
156
+ "kenlm_id": "pt",
157
+ },
158
+ {
159
+ "lang": "Somali",
160
+ "dataset_id": "so",
161
+ "stopwords_id": "so",
162
+ "badwords_id": None,
163
+ "fasttext_id": "so",
164
+ "sentencepiece_id": None,
165
+ "kenlm_id": None,
166
+ },
167
+ {
168
+ "lang": "Swahili",
169
+ "dataset_id": "sw",
170
+ "stopwords_id": "sw",
171
+ "badwords_id": None,
172
+ "fasttext_id": "sw",
173
+ "sentencepiece_id": None,
174
+ "kenlm_id": None,
175
+ },
176
+ {
177
+ "lang": "Tamil",
178
+ "dataset_id": "ta",
179
+ "stopwords_id": None,
180
+ "badwords_id": None,
181
+ "fasttext_id": "ta",
182
+ "sentencepiece_id": None,
183
+ "kenlm_id": None,
184
+ },
185
+ {
186
+ "lang": "Telugu",
187
+ "dataset_id": "te",
188
+ "stopwords_id": None,
189
+ "badwords_id": "te",
190
+ "fasttext_id": "te",
191
+ "sentencepiece_id": None,
192
+ "kenlm_id": None,
193
+ },
194
+ {
195
+ "lang": "Urdu",
196
+ "dataset_id": "ur",
197
+ "stopwords_id": "ur",
198
+ "badwords_id": None,
199
+ "fasttext_id": "ur",
200
+ "sentencepiece_id": None,
201
+ "kenlm_id": None,
202
+ },
203
+ {
204
+ "lang": "Vietnamese",
205
+ "dataset_id": "vi",
206
+ "stopwords_id": "vi",
207
+ "badwords_id": "vi",
208
+ "fasttext_id": "vi",
209
+ "sentencepiece_id": None,
210
+ "kenlm_id": None,
211
+ },
212
+ {
213
+ "lang": "Yoruba",
214
+ "dataset_id": "yo",
215
+ "stopwords_id": "yo",
216
+ "badwords_id": None,
217
+ "fasttext_id": "yo",
218
+ "sentencepiece_id": None,
219
+ "kenlm_id": None,
220
+ },
221
+ {
222
+ "lang": "Chinese",
223
+ "dataset_id": "zh",
224
+ "stopwords_id": "zh",
225
+ "badwords_id": "zh",
226
+ "fasttext_id": "zh",
227
+ "sentencepiece_id": "zh",
228
+ "kenlm_id": "zh",
229
+ },
230
+ ]
231
+ langs_id = pd.DataFrame(langs_id)
lid.176.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e69ec5451bc261cc7844e49e4792a85d7f09c06789ec800fc4a44aec362764e
3
+ size 131266198
normalization.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import Dict
3
+
4
+
5
+ non_printing_characters_re = re.compile(
6
+ f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
7
+ )
8
+
9
+ digits_re: re.Pattern = re.compile(r"\d")
10
+
11
+ unicode_punctuation: Dict[str, str] = {
12
+ ",": ",",
13
+ "。": ".",
14
+ "、": ",",
15
+ "„": '"',
16
+ "”": '"',
17
+ "“": '"',
18
+ "«": '"',
19
+ "»": '"',
20
+ "1": '"',
21
+ "」": '"',
22
+ "「": '"',
23
+ "《": '"',
24
+ "》": '"',
25
+ "´": "'",
26
+ "∶": ":",
27
+ ":": ":",
28
+ "?": "?",
29
+ "!": "!",
30
+ "(": "(",
31
+ ")": ")",
32
+ ";": ";",
33
+ "–": "-",
34
+ "—": " - ",
35
+ ".": ". ",
36
+ "~": "~",
37
+ "’": "'",
38
+ "…": "...",
39
+ "━": "-",
40
+ "〈": "<",
41
+ "〉": ">",
42
+ "【": "[",
43
+ "】": "]",
44
+ "%": "%",
45
+ "►": "-",
46
+ }
47
+
48
+ normalization = {
49
+ "non_printing_characters_re": non_printing_characters_re,
50
+ "digits_re": digits_re,
51
+ "unicode_punctuation": unicode_punctuation,
52
+ }
requirements.txt → packages.txt RENAMED
File without changes
parameters_filtering.py ADDED
@@ -0,0 +1,852 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import string
2
+ import emoji
3
+
4
+
5
+ main_special_characters = string.punctuation + string.digits + string.whitespace
6
+ other_special_characters = (
7
+ "’ “— ™ – •‘œ    ˜ ‚ƒ„’“”–ー一▬…✦�­£​•€«»°·═"
8
+ "×士^˘⇓↓↑←→()§″′´¿−±∈¢ø‚„½¼¾¹²³―⁃,ˌ¸‹›ʺˈʻ¦‐⠀‰……‑≤≥‖"
9
+ "◆●■►▼▲▴∆▻¡★☆✱ːº。¯˜¥ɪ≈†上ン:∼⁄・♡✓⊕․.⋅÷1‟;،、¨ाাी्े◦˚"
10
+ "゜ʼ≖ʼ¤ッツシ℃√!【】‿∞➤~πه۩☛₨➩☻๑٪♥ıॽ《‘©﴿٬x?▷Г♫∟™ª₪®「—"
11
+ "❖」﴾》"
12
+ )
13
+ emoji = list(emoji.UNICODE_EMOJI["en"].keys())
14
+
15
+ special_characters_default = set(main_special_characters + other_special_characters)
16
+ special_characters_default.update(emoji)
17
+
18
+
19
+ parameters_filtering_default = {
20
+ "cond_uniform_whitespace": True,
21
+ "cond_replace_unicode_punctuation": False,
22
+ "cond_remove_words_with_incorrect_substrings": False,
23
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
24
+ "cond_remove_long_words": False,
25
+ "length_word_max_cutoff": 50,
26
+ "cond_check_number_words": True,
27
+ "tokenization": False,
28
+ "strip_characters": special_characters_default,
29
+ "number_words_min_cutoff": 1,
30
+ "number_words_max_cutoff": 100000,
31
+ "check_repetitions_removal": True,
32
+ "repetitions_length": 10,
33
+ "repetitions_max_cutoff": 0.106,
34
+ "cond_check_special_characters": True,
35
+ "special_characters": special_characters_default,
36
+ "special_characters_max_cutoff": 0.4,
37
+ "cond_words_augmentation": False,
38
+ "words_augmentation_group_sizes": [],
39
+ "words_augmentation_join_char": "",
40
+ "cond_check_stopwords": False,
41
+ "stopwords_min_cutoff": 0,
42
+ "cond_check_badwords": False,
43
+ "badwords_max_cutoff": 0.2,
44
+ "cond_check_lang_id": True,
45
+ "lang_id_min_cutoff": 0.70,
46
+ "cond_check_perplexity": False,
47
+ "perplexity_max_cutoff": 3000000,
48
+ }
49
+
50
+ parameters_filtering_af = {
51
+ "cond_uniform_whitespace": True,
52
+ "cond_replace_unicode_punctuation": False,
53
+ "cond_remove_words_with_incorrect_substrings": False,
54
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
55
+ "cond_remove_long_words": True,
56
+ "length_word_max_cutoff": 25,
57
+ "cond_check_number_words": True,
58
+ "tokenization": False,
59
+ "strip_characters": special_characters_default,
60
+ "number_words_min_cutoff": 1,
61
+ "number_words_max_cutoff": 100000,
62
+ "check_repetitions_removal": True,
63
+ "repetitions_length": 10,
64
+ "repetitions_max_cutoff": 0.106,
65
+ "cond_check_special_characters": True,
66
+ "special_characters": special_characters_default,
67
+ "special_characters_max_cutoff": 0.3,
68
+ "cond_words_augmentation": False,
69
+ "words_augmentation_group_sizes": [],
70
+ "words_augmentation_join_char": "",
71
+ "cond_check_stopwords": True,
72
+ "stopwords_min_cutoff": 0,
73
+ "cond_check_badwords": False,
74
+ "badwords_max_cutoff": 0.2,
75
+ "cond_check_lang_id": True,
76
+ "lang_id_min_cutoff": 0.6,
77
+ "cond_check_perplexity": True,
78
+ "perplexity_max_cutoff": 3000000,
79
+ }
80
+
81
+ parameters_filtering_ar = {
82
+ "cond_uniform_whitespace": True,
83
+ "cond_replace_unicode_punctuation": False,
84
+ "cond_remove_words_with_incorrect_substrings": False,
85
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
86
+ "cond_remove_long_words": True,
87
+ "length_word_max_cutoff": 25,
88
+ "cond_check_number_words": True,
89
+ "tokenization": False,
90
+ "strip_characters": special_characters_default,
91
+ "number_words_min_cutoff": 1,
92
+ "number_words_max_cutoff": 100000,
93
+ "check_repetitions_removal": True,
94
+ "repetitions_length": 10,
95
+ "repetitions_max_cutoff": 0.106,
96
+ "cond_check_special_characters": True,
97
+ "special_characters": special_characters_default,
98
+ "special_characters_max_cutoff": 0.45,
99
+ "cond_words_augmentation": False,
100
+ "words_augmentation_group_sizes": [],
101
+ "words_augmentation_join_char": "",
102
+ "cond_check_stopwords": True,
103
+ "stopwords_min_cutoff": 0,
104
+ "cond_check_badwords": False,
105
+ "badwords_max_cutoff": 0.2,
106
+ "cond_check_lang_id": True,
107
+ "lang_id_min_cutoff": 0.75,
108
+ "cond_check_perplexity": True,
109
+ "perplexity_max_cutoff": 1000000,
110
+ }
111
+
112
+ parameters_filtering_arz = {
113
+ "cond_uniform_whitespace": True,
114
+ "cond_replace_unicode_punctuation": False,
115
+ "cond_remove_words_with_incorrect_substrings": False,
116
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
117
+ "cond_remove_long_words": True,
118
+ "length_word_max_cutoff": 25,
119
+ "cond_check_number_words": True,
120
+ "tokenization": False,
121
+ "strip_characters": special_characters_default,
122
+ "number_words_min_cutoff": 1,
123
+ "number_words_max_cutoff": 100000,
124
+ "check_repetitions_removal": True,
125
+ "repetitions_length": 10,
126
+ "repetitions_max_cutoff": 0.106,
127
+ "cond_check_special_characters": True,
128
+ "special_characters": special_characters_default,
129
+ "special_characters_max_cutoff": 0.5,
130
+ "cond_words_augmentation": False,
131
+ "words_augmentation_group_sizes": [],
132
+ "words_augmentation_join_char": "",
133
+ "cond_check_stopwords": True,
134
+ "stopwords_min_cutoff": 0,
135
+ "cond_check_badwords": False,
136
+ "badwords_max_cutoff": 0.2,
137
+ "cond_check_lang_id": True,
138
+ "lang_id_min_cutoff": 0.75,
139
+ "cond_check_perplexity": False,
140
+ "perplexity_max_cutoff": 3000000,
141
+ }
142
+
143
+ parameters_filtering_as = {
144
+ "cond_uniform_whitespace": True,
145
+ "cond_replace_unicode_punctuation": False,
146
+ "cond_remove_words_with_incorrect_substrings": False,
147
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
148
+ "cond_remove_long_words": True,
149
+ "length_word_max_cutoff": 25,
150
+ "cond_check_number_words": True,
151
+ "tokenization": False,
152
+ "strip_characters": special_characters_default,
153
+ "number_words_min_cutoff": 1,
154
+ "number_words_max_cutoff": 100000,
155
+ "check_repetitions_removal": True,
156
+ "repetitions_length": 10,
157
+ "repetitions_max_cutoff": 0.106,
158
+ "cond_check_special_characters": True,
159
+ "special_characters": special_characters_default,
160
+ "special_characters_max_cutoff": 0.25,
161
+ "cond_words_augmentation": False,
162
+ "words_augmentation_group_sizes": [],
163
+ "words_augmentation_join_char": "",
164
+ "cond_check_stopwords": True,
165
+ "stopwords_min_cutoff": 0,
166
+ "cond_check_badwords": False,
167
+ "badwords_max_cutoff": 0.2,
168
+ "cond_check_lang_id": True,
169
+ "lang_id_min_cutoff": 0.75,
170
+ "cond_check_perplexity": False,
171
+ "perplexity_max_cutoff": 3000000,
172
+ }
173
+
174
+ parameters_filtering_bn = {
175
+ "cond_uniform_whitespace": True,
176
+ "cond_replace_unicode_punctuation": False,
177
+ "cond_remove_words_with_incorrect_substrings": False,
178
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
179
+ "cond_remove_long_words": True,
180
+ "length_word_max_cutoff": 30,
181
+ "cond_check_number_words": True,
182
+ "tokenization": False,
183
+ "strip_characters": special_characters_default,
184
+ "number_words_min_cutoff": 1,
185
+ "number_words_max_cutoff": 100000,
186
+ "check_repetitions_removal": True,
187
+ "repetitions_length": 10,
188
+ "repetitions_max_cutoff": 0.106,
189
+ "cond_check_special_characters": True,
190
+ "special_characters": special_characters_default,
191
+ "special_characters_max_cutoff": 0.275,
192
+ "cond_words_augmentation": False,
193
+ "words_augmentation_group_sizes": [],
194
+ "words_augmentation_join_char": "",
195
+ "cond_check_stopwords": True,
196
+ "stopwords_min_cutoff": 0.05,
197
+ "cond_check_badwords": False,
198
+ "badwords_max_cutoff": 0.2,
199
+ "cond_check_lang_id": True,
200
+ "lang_id_min_cutoff": 0.75,
201
+ "cond_check_perplexity": False,
202
+ "perplexity_max_cutoff": 575000,
203
+ }
204
+
205
+ parameters_filtering_ca = {
206
+ "cond_uniform_whitespace": True,
207
+ "cond_replace_unicode_punctuation": False,
208
+ "cond_remove_words_with_incorrect_substrings": False,
209
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
210
+ "cond_remove_long_words": True,
211
+ "length_word_max_cutoff": 30,
212
+ "cond_check_number_words": True,
213
+ "tokenization": False,
214
+ "strip_characters": special_characters_default,
215
+ "number_words_min_cutoff": 1,
216
+ "number_words_max_cutoff": 100000,
217
+ "check_repetitions_removal": True,
218
+ "repetitions_length": 10,
219
+ "repetitions_max_cutoff": 0.106,
220
+ "cond_check_special_characters": True,
221
+ "special_characters": special_characters_default,
222
+ "special_characters_max_cutoff": 0.35,
223
+ "cond_words_augmentation": False,
224
+ "words_augmentation_group_sizes": [],
225
+ "words_augmentation_join_char": "",
226
+ "cond_check_stopwords": True,
227
+ "stopwords_min_cutoff": 0,
228
+ "cond_check_badwords": False,
229
+ "badwords_max_cutoff": 0.2,
230
+ "cond_check_lang_id": True,
231
+ "lang_id_min_cutoff": 0.75,
232
+ "cond_check_perplexity": True,
233
+ "perplexity_max_cutoff": 1750000,
234
+ }
235
+
236
+ parameters_filtering_en = {
237
+ "cond_uniform_whitespace": True,
238
+ "cond_replace_unicode_punctuation": False,
239
+ "cond_remove_words_with_incorrect_substrings": True,
240
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
241
+ "cond_remove_long_words": True,
242
+ "length_word_max_cutoff": 25,
243
+ "cond_check_number_words": True,
244
+ "tokenization": False,
245
+ "strip_characters": special_characters_default,
246
+ "number_words_min_cutoff": 20,
247
+ "number_words_max_cutoff": 100000,
248
+ "check_repetitions_removal": True,
249
+ "repetitions_length": 10,
250
+ "repetitions_max_cutoff": 0.106,
251
+ "cond_check_special_characters": True,
252
+ "special_characters": special_characters_default,
253
+ "special_characters_max_cutoff": 0.4,
254
+ "cond_words_augmentation": False,
255
+ "words_augmentation_group_sizes": [],
256
+ "words_augmentation_join_char": "",
257
+ "cond_check_stopwords": True,
258
+ "stopwords_min_cutoff": 0.3,
259
+ "cond_check_badwords": True,
260
+ "badwords_max_cutoff": 0.045,
261
+ "cond_check_lang_id": True,
262
+ "lang_id_min_cutoff": 0.80,
263
+ "cond_check_perplexity": True,
264
+ "perplexity_max_cutoff": 2500,
265
+ }
266
+
267
+ parameters_filtering_es = {
268
+ "cond_uniform_whitespace": True,
269
+ "cond_replace_unicode_punctuation": False,
270
+ "cond_remove_words_with_incorrect_substrings": False,
271
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
272
+ "cond_remove_long_words": True,
273
+ "length_word_max_cutoff": 30,
274
+ "cond_check_number_words": True,
275
+ "tokenization": False,
276
+ "strip_characters": special_characters_default,
277
+ "number_words_min_cutoff": 1,
278
+ "number_words_max_cutoff": 100000,
279
+ "check_repetitions_removal": True,
280
+ "repetitions_length": 10,
281
+ "repetitions_max_cutoff": 0.106,
282
+ "cond_check_special_characters": True,
283
+ "special_characters": special_characters_default,
284
+ "special_characters_max_cutoff": 0.3,
285
+ "cond_words_augmentation": False,
286
+ "words_augmentation_group_sizes": [],
287
+ "words_augmentation_join_char": "",
288
+ "cond_check_stopwords": True,
289
+ "stopwords_min_cutoff": 0.2,
290
+ "cond_check_badwords": False,
291
+ "badwords_max_cutoff": 0.2,
292
+ "cond_check_lang_id": True,
293
+ "lang_id_min_cutoff": 0.75,
294
+ "cond_check_perplexity": True,
295
+ "perplexity_max_cutoff": 2500000,
296
+ }
297
+
298
+ parameters_filtering_eu = {
299
+ "cond_uniform_whitespace": True,
300
+ "cond_replace_unicode_punctuation": False,
301
+ "cond_remove_words_with_incorrect_substrings": False,
302
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
303
+ "cond_remove_long_words": True,
304
+ "length_word_max_cutoff": 35,
305
+ "cond_check_number_words": True,
306
+ "tokenization": False,
307
+ "strip_characters": special_characters_default,
308
+ "number_words_min_cutoff": 1,
309
+ "number_words_max_cutoff": 100000,
310
+ "check_repetitions_removal": True,
311
+ "repetitions_length": 10,
312
+ "repetitions_max_cutoff": 0.106,
313
+ "cond_check_special_characters": True,
314
+ "special_characters": special_characters_default,
315
+ "special_characters_max_cutoff": 0.3,
316
+ "cond_words_augmentation": False,
317
+ "words_augmentation_group_sizes": [],
318
+ "words_augmentation_join_char": "",
319
+ "cond_check_stopwords": True,
320
+ "stopwords_min_cutoff": 0,
321
+ "cond_check_badwords": False,
322
+ "badwords_max_cutoff": 0.2,
323
+ "cond_check_lang_id": True,
324
+ "lang_id_min_cutoff": 0.75,
325
+ "cond_check_perplexity": False,
326
+ "perplexity_max_cutoff": 3000000,
327
+ }
328
+
329
+ parameters_filtering_fr = {
330
+ "cond_uniform_whitespace": True,
331
+ "cond_replace_unicode_punctuation": False,
332
+ "cond_remove_words_with_incorrect_substrings": False,
333
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
334
+ "cond_remove_long_words": True,
335
+ "length_word_max_cutoff": 30,
336
+ "cond_check_number_words": True,
337
+ "tokenization": False,
338
+ "strip_characters": special_characters_default,
339
+ "number_words_min_cutoff": 1,
340
+ "number_words_max_cutoff": 100000,
341
+ "check_repetitions_removal": True,
342
+ "repetitions_length": 10,
343
+ "repetitions_max_cutoff": 0.106,
344
+ "cond_check_special_characters": True,
345
+ "special_characters": special_characters_default,
346
+ "special_characters_max_cutoff": 0.35,
347
+ "cond_words_augmentation": False,
348
+ "words_augmentation_group_sizes": [],
349
+ "words_augmentation_join_char": "",
350
+ "cond_check_stopwords": True,
351
+ "stopwords_min_cutoff": 0.15,
352
+ "cond_check_badwords": False,
353
+ "badwords_max_cutoff": 0.2,
354
+ "cond_check_lang_id": True,
355
+ "lang_id_min_cutoff": 0.75,
356
+ "cond_check_perplexity": True,
357
+ "perplexity_max_cutoff": 3000000,
358
+ }
359
+
360
+ parameters_filtering_gu = {
361
+ "cond_uniform_whitespace": True,
362
+ "cond_replace_unicode_punctuation": False,
363
+ "cond_remove_words_with_incorrect_substrings": False,
364
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
365
+ "cond_remove_long_words": True,
366
+ "length_word_max_cutoff": 30,
367
+ "cond_check_number_words": True,
368
+ "tokenization": False,
369
+ "strip_characters": special_characters_default,
370
+ "number_words_min_cutoff": 1,
371
+ "number_words_max_cutoff": 100000,
372
+ "check_repetitions_removal": True,
373
+ "repetitions_length": 10,
374
+ "repetitions_max_cutoff": 0.106,
375
+ "cond_check_special_characters": True,
376
+ "special_characters": special_characters_default,
377
+ "special_characters_max_cutoff": 0.3,
378
+ "cond_words_augmentation": False,
379
+ "words_augmentation_group_sizes": [],
380
+ "words_augmentation_join_char": "",
381
+ "cond_check_stopwords": True,
382
+ "stopwords_min_cutoff": 0,
383
+ "cond_check_badwords": False,
384
+ "badwords_max_cutoff": 0.2,
385
+ "cond_check_lang_id": True,
386
+ "lang_id_min_cutoff": 0.75,
387
+ "cond_check_perplexity": True,
388
+ "perplexity_max_cutoff": 250000,
389
+ }
390
+
391
+ parameters_filtering_hi = {
392
+ "cond_uniform_whitespace": True,
393
+ "cond_replace_unicode_punctuation": False,
394
+ "cond_remove_words_with_incorrect_substrings": False,
395
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
396
+ "cond_remove_long_words": True,
397
+ "length_word_max_cutoff": 25,
398
+ "cond_check_number_words": True,
399
+ "tokenization": False,
400
+ "strip_characters": special_characters_default,
401
+ "number_words_min_cutoff": 1,
402
+ "number_words_max_cutoff": 100000,
403
+ "check_repetitions_removal": True,
404
+ "repetitions_length": 10,
405
+ "repetitions_max_cutoff": 0.106,
406
+ "cond_check_special_characters": True,
407
+ "special_characters": special_characters_default,
408
+ "special_characters_max_cutoff": 0.35,
409
+ "cond_words_augmentation": False,
410
+ "words_augmentation_group_sizes": [],
411
+ "words_augmentation_join_char": "",
412
+ "cond_check_stopwords": True,
413
+ "stopwords_min_cutoff": 0,
414
+ "cond_check_badwords": False,
415
+ "badwords_max_cutoff": 0.2,
416
+ "cond_check_lang_id": True,
417
+ "lang_id_min_cutoff": 0.75,
418
+ "cond_check_perplexity": True,
419
+ "perplexity_max_cutoff": 600000,
420
+ }
421
+
422
+ parameters_filtering_id = {
423
+ "cond_uniform_whitespace": True,
424
+ "cond_replace_unicode_punctuation": False,
425
+ "cond_remove_words_with_incorrect_substrings": False,
426
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
427
+ "cond_remove_long_words": True,
428
+ "length_word_max_cutoff": 30,
429
+ "cond_check_number_words": True,
430
+ "tokenization": False,
431
+ "strip_characters": special_characters_default,
432
+ "number_words_min_cutoff": 1,
433
+ "number_words_max_cutoff": 100000,
434
+ "check_repetitions_removal": True,
435
+ "repetitions_length": 10,
436
+ "repetitions_max_cutoff": 0.106,
437
+ "cond_check_special_characters": True,
438
+ "special_characters": special_characters_default,
439
+ "special_characters_max_cutoff": 0.25,
440
+ "cond_words_augmentation": False,
441
+ "words_augmentation_group_sizes": [],
442
+ "words_augmentation_join_char": "",
443
+ "cond_check_stopwords": True,
444
+ "stopwords_min_cutoff": 0.25,
445
+ "cond_check_badwords": False,
446
+ "badwords_max_cutoff": 0.2,
447
+ "cond_check_lang_id": True,
448
+ "lang_id_min_cutoff": 0.75,
449
+ "cond_check_perplexity": True,
450
+ "perplexity_max_cutoff": 2500000,
451
+ }
452
+
453
+ parameters_filtering_kn = {
454
+ "cond_uniform_whitespace": True,
455
+ "cond_replace_unicode_punctuation": False,
456
+ "cond_remove_words_with_incorrect_substrings": False,
457
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
458
+ "cond_remove_long_words": True,
459
+ "length_word_max_cutoff": 50,
460
+ "cond_check_number_words": True,
461
+ "tokenization": False,
462
+ "strip_characters": special_characters_default,
463
+ "number_words_min_cutoff": 1,
464
+ "number_words_max_cutoff": 100000,
465
+ "check_repetitions_removal": True,
466
+ "repetitions_length": 10,
467
+ "repetitions_max_cutoff": 0.106,
468
+ "cond_check_special_characters": True,
469
+ "special_characters": special_characters_default,
470
+ "special_characters_max_cutoff": 0.25,
471
+ "cond_words_augmentation": False,
472
+ "words_augmentation_group_sizes": [],
473
+ "words_augmentation_join_char": "",
474
+ "cond_check_stopwords": True,
475
+ "stopwords_min_cutoff": 0,
476
+ "cond_check_badwords": False,
477
+ "badwords_max_cutoff": 0.2,
478
+ "cond_check_lang_id": True,
479
+ "lang_id_min_cutoff": 0.75,
480
+ "cond_check_perplexity": True,
481
+ "perplexity_max_cutoff": 400000,
482
+ }
483
+
484
+ parameters_filtering_ml = {
485
+ "cond_uniform_whitespace": True,
486
+ "cond_replace_unicode_punctuation": False,
487
+ "cond_remove_words_with_incorrect_substrings": False,
488
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
489
+ "cond_remove_long_words": True,
490
+ "length_word_max_cutoff": 50,
491
+ "cond_check_number_words": True,
492
+ "tokenization": False,
493
+ "strip_characters": special_characters_default,
494
+ "number_words_min_cutoff": 1,
495
+ "number_words_max_cutoff": 100000,
496
+ "check_repetitions_removal": True,
497
+ "repetitions_length": 10,
498
+ "repetitions_max_cutoff": 0.106,
499
+ "cond_check_special_characters": True,
500
+ "special_characters": special_characters_default,
501
+ "special_characters_max_cutoff": 0.2,
502
+ "cond_words_augmentation": False,
503
+ "words_augmentation_group_sizes": [],
504
+ "words_augmentation_join_char": "",
505
+ "cond_check_stopwords": True,
506
+ "stopwords_min_cutoff": 0,
507
+ "cond_check_badwords": False,
508
+ "badwords_max_cutoff": 0.2,
509
+ "cond_check_lang_id": True,
510
+ "lang_id_min_cutoff": 0.75,
511
+ "cond_check_perplexity": True,
512
+ "perplexity_max_cutoff": 1600000,
513
+ }
514
+
515
+ parameters_filtering_mr = {
516
+ "cond_uniform_whitespace": True,
517
+ "cond_replace_unicode_punctuation": False,
518
+ "cond_remove_words_with_incorrect_substrings": False,
519
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
520
+ "cond_remove_long_words": True,
521
+ "length_word_max_cutoff": 30,
522
+ "cond_check_number_words": True,
523
+ "tokenization": False,
524
+ "strip_characters": special_characters_default,
525
+ "number_words_min_cutoff": 1,
526
+ "number_words_max_cutoff": 100000,
527
+ "check_repetitions_removal": True,
528
+ "repetitions_length": 10,
529
+ "repetitions_max_cutoff": 0.106,
530
+ "cond_check_special_characters": True,
531
+ "special_characters": special_characters_default,
532
+ "special_characters_max_cutoff": 0.25,
533
+ "cond_words_augmentation": False,
534
+ "words_augmentation_group_sizes": [],
535
+ "words_augmentation_join_char": "",
536
+ "cond_check_stopwords": True,
537
+ "stopwords_min_cutoff": 0,
538
+ "cond_check_badwords": False,
539
+ "badwords_max_cutoff": 0.2,
540
+ "cond_check_lang_id": True,
541
+ "lang_id_min_cutoff": 0.75,
542
+ "cond_check_perplexity": True,
543
+ "perplexity_max_cutoff": 425000,
544
+ }
545
+
546
+ parameters_filtering_pt = {
547
+ "cond_uniform_whitespace": True,
548
+ "cond_replace_unicode_punctuation": False,
549
+ "cond_remove_words_with_incorrect_substrings": False,
550
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
551
+ "cond_remove_long_words": True,
552
+ "length_word_max_cutoff": 30,
553
+ "cond_check_number_words": True,
554
+ "tokenization": False,
555
+ "strip_characters": special_characters_default,
556
+ "number_words_min_cutoff": 1,
557
+ "number_words_max_cutoff": 100000,
558
+ "check_repetitions_removal": True,
559
+ "repetitions_length": 10,
560
+ "repetitions_max_cutoff": 0.106,
561
+ "cond_check_special_characters": True,
562
+ "special_characters": special_characters_default,
563
+ "special_characters_max_cutoff": 0.3,
564
+ "cond_words_augmentation": False,
565
+ "words_augmentation_group_sizes": [],
566
+ "words_augmentation_join_char": "",
567
+ "cond_check_stopwords": True,
568
+ "stopwords_min_cutoff": 0.15,
569
+ "cond_check_badwords": False,
570
+ "badwords_max_cutoff": 0.2,
571
+ "cond_check_lang_id": True,
572
+ "lang_id_min_cutoff": 0.75,
573
+ "cond_check_perplexity": True,
574
+ "perplexity_max_cutoff": 3000000,
575
+ }
576
+
577
+ parameters_filtering_so = {
578
+ "cond_uniform_whitespace": True,
579
+ "cond_replace_unicode_punctuation": False,
580
+ "cond_remove_words_with_incorrect_substrings": False,
581
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
582
+ "cond_remove_long_words": False,
583
+ "length_word_max_cutoff": 1000,
584
+ "cond_check_number_words": True,
585
+ "tokenization": False,
586
+ "strip_characters": special_characters_default,
587
+ "number_words_min_cutoff": 1,
588
+ "number_words_max_cutoff": 100000,
589
+ "check_repetitions_removal": True,
590
+ "repetitions_length": 10,
591
+ "repetitions_max_cutoff": 0.106,
592
+ "cond_check_special_characters": True,
593
+ "special_characters": special_characters_default,
594
+ "special_characters_max_cutoff": 0.3,
595
+ "cond_words_augmentation": False,
596
+ "words_augmentation_group_sizes": [],
597
+ "words_augmentation_join_char": "",
598
+ "cond_check_stopwords": False,
599
+ "stopwords_min_cutoff": 0,
600
+ "cond_check_badwords": False,
601
+ "badwords_max_cutoff": 0.2,
602
+ "cond_check_lang_id": True,
603
+ "lang_id_min_cutoff": 0.75,
604
+ "cond_check_perplexity": False,
605
+ "perplexity_max_cutoff": 3000000,
606
+ }
607
+
608
+ parameters_filtering_sw = {
609
+ "cond_uniform_whitespace": True,
610
+ "cond_replace_unicode_punctuation": False,
611
+ "cond_remove_words_with_incorrect_substrings": False,
612
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
613
+ "cond_remove_long_words": True,
614
+ "length_word_max_cutoff": 30,
615
+ "cond_check_number_words": True,
616
+ "tokenization": False,
617
+ "strip_characters": special_characters_default,
618
+ "number_words_min_cutoff": 1,
619
+ "number_words_max_cutoff": 100000,
620
+ "check_repetitions_removal": True,
621
+ "repetitions_length": 10,
622
+ "repetitions_max_cutoff": 0.106,
623
+ "cond_check_special_characters": True,
624
+ "special_characters": special_characters_default,
625
+ "special_characters_max_cutoff": 0.275,
626
+ "cond_words_augmentation": False,
627
+ "words_augmentation_group_sizes": [],
628
+ "words_augmentation_join_char": "",
629
+ "cond_check_stopwords": True,
630
+ "stopwords_min_cutoff": 0,
631
+ "cond_check_badwords": False,
632
+ "badwords_max_cutoff": 0.2,
633
+ "cond_check_lang_id": True,
634
+ "lang_id_min_cutoff": 0.75,
635
+ "cond_check_perplexity": False,
636
+ "perplexity_max_cutoff": 3000000,
637
+ }
638
+
639
+ parameters_filtering_ta = {
640
+ "cond_uniform_whitespace": True,
641
+ "cond_replace_unicode_punctuation": False,
642
+ "cond_remove_words_with_incorrect_substrings": False,
643
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
644
+ "cond_remove_long_words": True,
645
+ "length_word_max_cutoff": 50,
646
+ "cond_check_number_words": True,
647
+ "tokenization": False,
648
+ "strip_characters": special_characters_default,
649
+ "number_words_min_cutoff": 1,
650
+ "number_words_max_cutoff": 100000,
651
+ "check_repetitions_removal": True,
652
+ "repetitions_length": 10,
653
+ "repetitions_max_cutoff": 0.106,
654
+ "cond_check_special_characters": True,
655
+ "special_characters": special_characters_default,
656
+ "special_characters_max_cutoff": 0.25,
657
+ "cond_words_augmentation": False,
658
+ "words_augmentation_group_sizes": [],
659
+ "words_augmentation_join_char": "",
660
+ "cond_check_stopwords": True,
661
+ "stopwords_min_cutoff": 0,
662
+ "cond_check_badwords": False,
663
+ "badwords_max_cutoff": 0.2,
664
+ "cond_check_lang_id": True,
665
+ "lang_id_min_cutoff": 0.75,
666
+ "cond_check_perplexity": False,
667
+ "perplexity_max_cutoff": 3000000,
668
+ }
669
+
670
+ parameters_filtering_te = {
671
+ "cond_uniform_whitespace": True,
672
+ "cond_replace_unicode_punctuation": False,
673
+ "cond_remove_words_with_incorrect_substrings": False,
674
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
675
+ "cond_remove_long_words": True,
676
+ "length_word_max_cutoff": 35,
677
+ "cond_check_number_words": True,
678
+ "tokenization": False,
679
+ "strip_characters": special_characters_default,
680
+ "number_words_min_cutoff": 1,
681
+ "number_words_max_cutoff": 100000,
682
+ "check_repetitions_removal": True,
683
+ "repetitions_length": 10,
684
+ "repetitions_max_cutoff": 0.106,
685
+ "cond_check_special_characters": True,
686
+ "special_characters": special_characters_default,
687
+ "special_characters_max_cutoff": 0.25,
688
+ "cond_words_augmentation": False,
689
+ "words_augmentation_group_sizes": [],
690
+ "words_augmentation_join_char": "",
691
+ "cond_check_stopwords": True,
692
+ "stopwords_min_cutoff": 0,
693
+ "cond_check_badwords": False,
694
+ "badwords_max_cutoff": 0.2,
695
+ "cond_check_lang_id": True,
696
+ "lang_id_min_cutoff": 0.75,
697
+ "cond_check_perplexity": False,
698
+ "perplexity_max_cutoff": 3000000,
699
+ }
700
+
701
+ parameters_filtering_ur = {
702
+ "cond_uniform_whitespace": True,
703
+ "cond_replace_unicode_punctuation": False,
704
+ "cond_remove_words_with_incorrect_substrings": False,
705
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
706
+ "cond_remove_long_words": True,
707
+ "length_word_max_cutoff": 30,
708
+ "cond_check_number_words": True,
709
+ "tokenization": False,
710
+ "strip_characters": special_characters_default,
711
+ "number_words_min_cutoff": 1,
712
+ "number_words_max_cutoff": 100000,
713
+ "check_repetitions_removal": True,
714
+ "repetitions_length": 10,
715
+ "repetitions_max_cutoff": 0.106,
716
+ "cond_check_special_characters": True,
717
+ "special_characters": special_characters_default,
718
+ "special_characters_max_cutoff": 0.4,
719
+ "cond_words_augmentation": False,
720
+ "words_augmentation_group_sizes": [],
721
+ "words_augmentation_join_char": "",
722
+ "cond_check_stopwords": True,
723
+ "stopwords_min_cutoff": 0,
724
+ "cond_check_badwords": False,
725
+ "badwords_max_cutoff": 0.2,
726
+ "cond_check_lang_id": True,
727
+ "lang_id_min_cutoff": 0.75,
728
+ "cond_check_perplexity": False,
729
+ "perplexity_max_cutoff": 3000000,
730
+ }
731
+
732
+ parameters_filtering_vi = {
733
+ "cond_uniform_whitespace": True,
734
+ "cond_replace_unicode_punctuation": False,
735
+ "cond_remove_words_with_incorrect_substrings": False,
736
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
737
+ "cond_remove_long_words": True,
738
+ "length_word_max_cutoff": 30,
739
+ "cond_check_number_words": True,
740
+ "tokenization": False,
741
+ "strip_characters": special_characters_default,
742
+ "number_words_min_cutoff": 1,
743
+ "number_words_max_cutoff": 100000,
744
+ "check_repetitions_removal": True,
745
+ "repetitions_length": 10,
746
+ "repetitions_max_cutoff": 0.106,
747
+ "cond_check_special_characters": True,
748
+ "special_characters": special_characters_default,
749
+ "special_characters_max_cutoff": 0.35,
750
+ "cond_words_augmentation": True,
751
+ "words_augmentation_group_sizes": [2, 3],
752
+ "words_augmentation_join_char": " ",
753
+ "cond_check_stopwords": True,
754
+ "stopwords_min_cutoff": 0,
755
+ "cond_check_badwords": False,
756
+ "badwords_max_cutoff": 0.2,
757
+ "cond_check_lang_id": True,
758
+ "lang_id_min_cutoff": 0.75,
759
+ "cond_check_perplexity": False,
760
+ "perplexity_max_cutoff": 3000000,
761
+ }
762
+
763
+ parameters_filtering_yo = {
764
+ "cond_uniform_whitespace": True,
765
+ "cond_replace_unicode_punctuation": False,
766
+ "cond_remove_words_with_incorrect_substrings": False,
767
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
768
+ "cond_remove_long_words": True,
769
+ "length_word_max_cutoff": 30,
770
+ "cond_check_number_words": True,
771
+ "tokenization": False,
772
+ "strip_characters": special_characters_default,
773
+ "number_words_min_cutoff": 1,
774
+ "number_words_max_cutoff": 100000,
775
+ "check_repetitions_removal": True,
776
+ "repetitions_length": 10,
777
+ "repetitions_max_cutoff": 0.106,
778
+ "cond_check_special_characters": True,
779
+ "special_characters": special_characters_default,
780
+ "special_characters_max_cutoff": 0.3,
781
+ "cond_words_augmentation": False,
782
+ "words_augmentation_group_sizes": [],
783
+ "words_augmentation_join_char": "",
784
+ "cond_check_stopwords": True,
785
+ "stopwords_min_cutoff": 0,
786
+ "cond_check_badwords": False,
787
+ "badwords_max_cutoff": 0.2,
788
+ "cond_check_lang_id": True,
789
+ "lang_id_min_cutoff": 0.75,
790
+ "cond_check_perplexity": False,
791
+ "perplexity_max_cutoff": 3000000,
792
+ }
793
+
794
+ parameters_filtering_zh = {
795
+ "cond_uniform_whitespace": True,
796
+ "cond_replace_unicode_punctuation": False,
797
+ "cond_remove_words_with_incorrect_substrings": False,
798
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
799
+ "cond_remove_long_words": False,
800
+ "length_word_max_cutoff": 1000,
801
+ "cond_check_number_words": True,
802
+ "tokenization": True,
803
+ "strip_characters": special_characters_default,
804
+ "number_words_min_cutoff": 1,
805
+ "number_words_max_cutoff": 100000,
806
+ "check_repetitions_removal": True,
807
+ "repetitions_length": 10,
808
+ "repetitions_max_cutoff": 0.106,
809
+ "cond_check_special_characters": True,
810
+ "special_characters": special_characters_default,
811
+ "special_characters_max_cutoff": 0.4,
812
+ "cond_words_augmentation": True,
813
+ "words_augmentation_group_sizes": [2, 3],
814
+ "words_augmentation_join_char": "",
815
+ "cond_check_stopwords": False,
816
+ "stopwords_min_cutoff": 0,
817
+ "cond_check_badwords": False,
818
+ "badwords_max_cutoff": 0.2,
819
+ "cond_check_lang_id": True,
820
+ "lang_id_min_cutoff": 0.75,
821
+ "cond_check_perplexity": False,
822
+ "perplexity_max_cutoff": 3000000,
823
+ }
824
+
825
+ parameters_filtering = {
826
+ "default": parameters_filtering_default,
827
+ "af": parameters_filtering_af,
828
+ "ar": parameters_filtering_ar,
829
+ "arz": parameters_filtering_arz,
830
+ "as": parameters_filtering_as,
831
+ "bn": parameters_filtering_bn,
832
+ "ca": parameters_filtering_ca,
833
+ "en": parameters_filtering_en,
834
+ "es": parameters_filtering_es,
835
+ "eu": parameters_filtering_eu,
836
+ "fr": parameters_filtering_fr,
837
+ "gu": parameters_filtering_gu,
838
+ "hi": parameters_filtering_hi,
839
+ "id": parameters_filtering_id,
840
+ "kn": parameters_filtering_kn,
841
+ "ml": parameters_filtering_ml,
842
+ "mr": parameters_filtering_mr,
843
+ "pt": parameters_filtering_pt,
844
+ "so": parameters_filtering_so,
845
+ "sw": parameters_filtering_sw,
846
+ "ta": parameters_filtering_ta,
847
+ "te": parameters_filtering_te,
848
+ "ur": parameters_filtering_ur,
849
+ "vi": parameters_filtering_vi,
850
+ "yo": parameters_filtering_yo,
851
+ "zh": parameters_filtering_zh,
852
+ }
stopwords.py ADDED
The diff for this file is too large to render. See raw diff