metadata
datasets:
- ealvaradob/phishing-dataset
language:
- en
base_model:
- CrabInHoney/urlbert-tiny-base-v3
pipeline_tag: text-classification
tags:
- url
- urls
- links
- classification
- tiny
- phishing
- urlbert
license: apache-2.0
This is a very small version of BERT, designed to categorize links into phishing and non-phishing links
An updated, lighter version of the old classification model for URL analysis
Old version: https://huggingface.co/CrabInHoney/urlbert-tiny-v2-phishing-classifier
Comparison with the previous version of urlbert phishing-classifier:
Version | Accuracy | Precision | Recall | F1-score |
---|---|---|---|---|
v2 | 0.9665 | 0.9756 | 0.9522 | 0.9637 |
v3 | 0.9819 | 0.9876 | 0.9734 | 0.9805 |
Model size
3.69M params
Tensor type
F32
Dataset (urls.json only)
Example:
from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Используемое устройство: {device}")
model_name = "CrabInHoney/urlbert-tiny-v3-phishing-classifier"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.to(device)
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
device=0 if torch.cuda.is_available() else -1,
return_all_scores=True
)
test_urls = [
"huggingface.co/",
"hu991ngface.com.ru/"
]
label_mapping = {"LABEL_0": "good", "LABEL_1": "fish"}
for url in test_urls:
results = classifier(url)
print(f"\nURL: {url}")
for result in results[0]:
label = result['label']
score = result['score']
friendly_label = label_mapping.get(label, label)
print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
Используемое устройство: cuda
URL: huggingface.co/
Класс: good, вероятность: 0.9723
Класс: fish, вероятность: 0.0277
URL: hu991ngface.com.ru/
Класс: good, вероятность: 0.0070
Класс: fish, вероятность: 0.9930