belenedgar commited on
Commit
c3c5ad5
·
1 Parent(s): b3ac55a

Add BERTopic model

Browse files
Files changed (4) hide show
  1. README.md +72 -0
  2. config.json +15 -0
  3. topic_embeddings.safetensors +3 -0
  4. topics.json +416 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # transformers_issues_topics
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("belenedgar/transformers_issues_topics")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 5
34
+ * Number of training documents: 156
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | extremism - extremist - terrorism - radical - radicalization | 21 | -1_extremism_extremist_terrorism_radical |
42
+ | 0 | phishing - theft - scammers - security - fraud | 17 | 0_phishing_theft_scammers_security |
43
+ | 1 | addiction - violence - cyber - content - presence | 54 | 1_addiction_violence_cyber_content |
44
+ | 2 | cyberbullying - bullying - cyber - cyberstalking - harassment | 39 | 2_cyberbullying_bullying_cyber_cyberstalking |
45
+ | 3 | profanity - derogatory - vulgarity - hate - offensive | 25 | 3_profanity_derogatory_vulgarity_hate |
46
+
47
+ </details>
48
+
49
+ ## Training hyperparameters
50
+
51
+ * calculate_probabilities: False
52
+ * language: english
53
+ * low_memory: False
54
+ * min_topic_size: 10
55
+ * n_gram_range: (1, 1)
56
+ * nr_topics: None
57
+ * seed_topic_list: None
58
+ * top_n_words: 10
59
+ * verbose: False
60
+
61
+ ## Framework versions
62
+
63
+ * Numpy: 1.24.4
64
+ * HDBSCAN: 0.8.33
65
+ * UMAP: 0.5.3
66
+ * Pandas: 2.0.3
67
+ * Scikit-Learn: 1.3.0
68
+ * Sentence-transformers: 2.2.2
69
+ * Transformers: 4.31.0
70
+ * Numba: 0.57.1
71
+ * Plotly: 5.15.0
72
+ * Python: 3.10.10
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
15
+ }
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52be1790bfea1860793d98a9bcefec00f5850462095903c26e3ba1152425e3b7
3
+ size 7768
topics.json ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "topic_representations": {
3
+ "-1": [
4
+ [
5
+ "extremism",
6
+ 0.4750615656375885
7
+ ],
8
+ [
9
+ "extremist",
10
+ 0.4353869557380676
11
+ ],
12
+ [
13
+ "terrorism",
14
+ 0.42795926332473755
15
+ ],
16
+ [
17
+ "radical",
18
+ 0.3870584964752197
19
+ ],
20
+ [
21
+ "radicalization",
22
+ 0.3648340106010437
23
+ ],
24
+ [
25
+ "undermine",
26
+ 0.33723917603492737
27
+ ],
28
+ [
29
+ "propaganda",
30
+ 0.3277747631072998
31
+ ],
32
+ [
33
+ "violent",
34
+ 0.3236936330795288
35
+ ],
36
+ [
37
+ "violence",
38
+ 0.3047242760658264
39
+ ],
40
+ [
41
+ "political",
42
+ 0.29089266061782837
43
+ ]
44
+ ],
45
+ "0": [
46
+ [
47
+ "phishing",
48
+ 0.4283202886581421
49
+ ],
50
+ [
51
+ "theft",
52
+ 0.35850489139556885
53
+ ],
54
+ [
55
+ "scammers",
56
+ 0.3532814383506775
57
+ ],
58
+ [
59
+ "security",
60
+ 0.34436333179473877
61
+ ],
62
+ [
63
+ "fraud",
64
+ 0.3367052674293518
65
+ ],
66
+ [
67
+ "fraudulent",
68
+ 0.3361189067363739
69
+ ],
70
+ [
71
+ "passwords",
72
+ 0.33049651980400085
73
+ ],
74
+ [
75
+ "privacy",
76
+ 0.31392160058021545
77
+ ],
78
+ [
79
+ "secure",
80
+ 0.30753785371780396
81
+ ],
82
+ [
83
+ "breaches",
84
+ 0.30685776472091675
85
+ ]
86
+ ],
87
+ "1": [
88
+ [
89
+ "addiction",
90
+ 0.4357987940311432
91
+ ],
92
+ [
93
+ "violence",
94
+ 0.3492211401462555
95
+ ],
96
+ [
97
+ "cyber",
98
+ 0.34194207191467285
99
+ ],
100
+ [
101
+ "content",
102
+ 0.3365675210952759
103
+ ],
104
+ [
105
+ "presence",
106
+ 0.33245617151260376
107
+ ],
108
+ [
109
+ "violent",
110
+ 0.3288145661354065
111
+ ],
112
+ [
113
+ "social",
114
+ 0.31930094957351685
115
+ ],
116
+ [
117
+ "screen",
118
+ 0.31717175245285034
119
+ ],
120
+ [
121
+ "persona",
122
+ 0.31366002559661865
123
+ ],
124
+ [
125
+ "media",
126
+ 0.3113996684551239
127
+ ]
128
+ ],
129
+ "2": [
130
+ [
131
+ "cyberbullying",
132
+ 0.4925338923931122
133
+ ],
134
+ [
135
+ "bullying",
136
+ 0.4540230333805084
137
+ ],
138
+ [
139
+ "cyber",
140
+ 0.43500053882598877
141
+ ],
142
+ [
143
+ "cyberstalking",
144
+ 0.4348675608634949
145
+ ],
146
+ [
147
+ "harassment",
148
+ 0.4151962995529175
149
+ ],
150
+ [
151
+ "predators",
152
+ 0.36431488394737244
153
+ ],
154
+ [
155
+ "abuse",
156
+ 0.35895395278930664
157
+ ],
158
+ [
159
+ "predatory",
160
+ 0.3434147238731384
161
+ ],
162
+ [
163
+ "behavior",
164
+ 0.3418422341346741
165
+ ],
166
+ [
167
+ "predator",
168
+ 0.318901389837265
169
+ ]
170
+ ],
171
+ "3": [
172
+ [
173
+ "profanity",
174
+ 0.4581272602081299
175
+ ],
176
+ [
177
+ "derogatory",
178
+ 0.37975403666496277
179
+ ],
180
+ [
181
+ "vulgarity",
182
+ 0.3786264657974243
183
+ ],
184
+ [
185
+ "hate",
186
+ 0.3713056445121765
187
+ ],
188
+ [
189
+ "offensive",
190
+ 0.36522164940834045
191
+ ],
192
+ [
193
+ "words",
194
+ 0.36498433351516724
195
+ ],
196
+ [
197
+ "vulgar",
198
+ 0.353113055229187
199
+ ],
200
+ [
201
+ "civility",
202
+ 0.3384415805339813
203
+ ],
204
+ [
205
+ "obscenity",
206
+ 0.3357018828392029
207
+ ],
208
+ [
209
+ "speech",
210
+ 0.3351168930530548
211
+ ]
212
+ ]
213
+ },
214
+ "topics": [
215
+ 1,
216
+ 1,
217
+ 2,
218
+ 2,
219
+ 1,
220
+ -1,
221
+ 3,
222
+ 0,
223
+ 2,
224
+ -1,
225
+ 3,
226
+ 2,
227
+ 0,
228
+ 0,
229
+ 0,
230
+ -1,
231
+ 0,
232
+ 2,
233
+ 0,
234
+ 2,
235
+ 3,
236
+ -1,
237
+ 3,
238
+ 0,
239
+ 0,
240
+ 0,
241
+ 0,
242
+ 2,
243
+ 3,
244
+ 1,
245
+ 1,
246
+ 1,
247
+ 0,
248
+ -1,
249
+ 3,
250
+ 1,
251
+ 2,
252
+ 1,
253
+ 1,
254
+ 0,
255
+ 0,
256
+ 3,
257
+ 0,
258
+ 0,
259
+ 0,
260
+ 3,
261
+ 0,
262
+ 1,
263
+ 3,
264
+ -1,
265
+ 2,
266
+ 0,
267
+ -1,
268
+ 3,
269
+ 0,
270
+ -1,
271
+ 1,
272
+ 2,
273
+ 2,
274
+ 1,
275
+ -1,
276
+ 0,
277
+ 2,
278
+ 0,
279
+ 2,
280
+ 1,
281
+ 1,
282
+ 1,
283
+ -1,
284
+ 0,
285
+ 3,
286
+ 0,
287
+ 0,
288
+ -1,
289
+ 2,
290
+ 3,
291
+ 3,
292
+ 1,
293
+ 2,
294
+ 0,
295
+ 0,
296
+ 1,
297
+ 2,
298
+ 1,
299
+ 2,
300
+ 3,
301
+ 0,
302
+ -1,
303
+ 0,
304
+ 2,
305
+ -1,
306
+ 0,
307
+ 1,
308
+ 0,
309
+ 0,
310
+ 0,
311
+ 0,
312
+ 1,
313
+ 1,
314
+ 2,
315
+ 1,
316
+ 1,
317
+ 3,
318
+ 1,
319
+ 2,
320
+ -1,
321
+ 1,
322
+ 1,
323
+ 0,
324
+ 0,
325
+ 1,
326
+ 0,
327
+ 1,
328
+ 1,
329
+ 3,
330
+ 1,
331
+ 0,
332
+ 0,
333
+ 1,
334
+ 1,
335
+ 1,
336
+ 3,
337
+ 1,
338
+ 2,
339
+ 0,
340
+ 0,
341
+ 3,
342
+ 0,
343
+ 0,
344
+ 0,
345
+ 0,
346
+ 3,
347
+ 0,
348
+ 0,
349
+ 1,
350
+ 3,
351
+ 1,
352
+ 2,
353
+ -1,
354
+ 1,
355
+ 1,
356
+ 0,
357
+ 0,
358
+ 0,
359
+ 0,
360
+ 0,
361
+ -1,
362
+ 3,
363
+ 0,
364
+ 0,
365
+ 2,
366
+ 2,
367
+ 2,
368
+ 1,
369
+ -1,
370
+ 0
371
+ ],
372
+ "topic_sizes": {
373
+ "1": 39,
374
+ "2": 25,
375
+ "-1": 17,
376
+ "3": 21,
377
+ "0": 54
378
+ },
379
+ "topic_mapper": [
380
+ [
381
+ -1,
382
+ -1,
383
+ -1
384
+ ],
385
+ [
386
+ 0,
387
+ 0,
388
+ 0
389
+ ],
390
+ [
391
+ 1,
392
+ 1,
393
+ 3
394
+ ],
395
+ [
396
+ 2,
397
+ 2,
398
+ 1
399
+ ],
400
+ [
401
+ 3,
402
+ 3,
403
+ 2
404
+ ]
405
+ ],
406
+ "topic_labels": {
407
+ "-1": "-1_extremism_extremist_terrorism_radical",
408
+ "0": "0_phishing_theft_scammers_security",
409
+ "1": "1_addiction_violence_cyber_content",
410
+ "2": "2_cyberbullying_bullying_cyber_cyberstalking",
411
+ "3": "3_profanity_derogatory_vulgarity_hate"
412
+ },
413
+ "custom_labels": null,
414
+ "_outliers": 1,
415
+ "topic_aspects": {}
416
+ }