CrabInHoney commited on
Commit
d4ae66c
·
verified ·
1 Parent(s): 89e5d1d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - ealvaradob/phishing-dataset
4
+ language:
5
+ - en
6
+ base_model:
7
+ - CrabInHoney/urlbert-tiny-base-v3
8
+ pipeline_tag: text-classification
9
+ tags:
10
+ - url
11
+ - urls
12
+ - links
13
+ - classification
14
+ - tiny
15
+ - phishing
16
+ - urlbert
17
+ ---
18
+ This is a very small version of BERT, designed to categorize links into phishing and non-phishing links
19
+
20
+ An updated, lighter version of the old classification model for URL analysis
21
+
22
+ Old version: https://huggingface.co/CrabInHoney/urlbert-tiny-v2-phishing-classifier
23
+ ##### Comparison with the previous version of urlbert phishing-classifier:
24
+
25
+ | Version | Accuracy | Precision | Recall | F1-score |
26
+ | ------------ | ------------ | ------------ | ------------ | ------------ |
27
+ | v2 | 0.9665 | 0.9756 | 0.9522 | 0.9637 |
28
+ | **v3** | **0.9819** | **0.9876** | **0.9734** | **0.9805** |
29
+
30
+
31
+ Model size
32
+
33
+ 3.69M params
34
+
35
+ Tensor type
36
+
37
+ F32
38
+
39
+ [Dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset "Dataset")
40
+ (urls.json only)
41
+
42
+ Example:
43
+
44
+
45
+
46
+ from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
47
+ import torch
48
+
49
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
50
+ print(f"Используемое устройство: {device}")
51
+
52
+ model_name = "CrabInHoney/urlbert-tiny-v3-phishing-classifier"
53
+
54
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
55
+ model = BertForSequenceClassification.from_pretrained(model_name)
56
+ model.to(device)
57
+
58
+ classifier = pipeline(
59
+ "text-classification",
60
+ model=model,
61
+ tokenizer=tokenizer,
62
+ device=0 if torch.cuda.is_available() else -1,
63
+ return_all_scores=True
64
+ )
65
+
66
+ test_urls = [
67
+ "huggingface.co/",
68
+ "hu991ngface.com.ru/"
69
+ ]
70
+
71
+ label_mapping = {"LABEL_0": "good", "LABEL_1": "fish"}
72
+
73
+ for url in test_urls:
74
+ results = classifier(url)
75
+ print(f"\nURL: {url}")
76
+ for result in results[0]:
77
+ label = result['label']
78
+ score = result['score']
79
+ friendly_label = label_mapping.get(label, label)
80
+ print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
81
+
82
+
83
+ Используемое устройство: cuda
84
+
85
+ URL: huggingface.co/
86
+
87
+ Класс: good, вероятность: 0.9723
88
+
89
+ Класс: fish, вероятность: 0.0277
90
+
91
+ URL: hu991ngface.com.ru/
92
+
93
+ Класс: good, вероятность: 0.0070
94
+
95
+ Класс: fish, вероятность: 0.9930
96
+
97
+