CrabInHoney's picture
Update README.md
45cdaf9 verified
metadata
datasets:
  - ealvaradob/phishing-dataset
language:
  - en
base_model:
  - CrabInHoney/urlbert-tiny-base-v3
pipeline_tag: text-classification
tags:
  - url
  - urls
  - links
  - classification
  - tiny
  - phishing
  - urlbert
license: apache-2.0

This is a very small version of BERT, designed to categorize links into phishing and non-phishing links

An updated, lighter version of the old classification model for URL analysis

Old version: https://huggingface.co/CrabInHoney/urlbert-tiny-v2-phishing-classifier

Comparison with the previous version of urlbert phishing-classifier:
Version Accuracy Precision Recall F1-score
v2 0.9665 0.9756 0.9522 0.9637
v3 0.9819 0.9876 0.9734 0.9805

Model size

3.69M params

Tensor type

F32

Dataset (urls.json only)

Example:

from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Используемое устройство: {device}")

model_name = "CrabInHoney/urlbert-tiny-v3-phishing-classifier"

tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.to(device)

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    return_all_scores=True
)

test_urls = [
    "huggingface.co/",
    "hu991ngface.com.ru/"
]

label_mapping = {"LABEL_0": "good", "LABEL_1": "fish"}

for url in test_urls:
    results = classifier(url)
    print(f"\nURL: {url}")
    for result in results[0]: 
        label = result['label']
        score = result['score']
        friendly_label = label_mapping.get(label, label)
        print(f"Класс: {friendly_label}, вероятность: {score:.4f}")

Используемое устройство: cuda

URL: huggingface.co/

Класс: good, вероятность: 0.9723

Класс: fish, вероятность: 0.0277

URL: hu991ngface.com.ru/

Класс: good, вероятность: 0.0070

Класс: fish, вероятность: 0.9930