DistilBERT Phishing Email Classifier


Model Overview

This is a fine-tuned DistilBERT model for phishing email detection. The model identifies whether an email is a phishing attempt based on its text. Phishing emails are malicious messages designed to deceive recipients into revealing sensitive information or taking unsafe actions. This model aims to assist in automatically detecting such emails to enhance cybersecurity.


Dataset

The model was fine-tuned on the Phishing Emails Dataset available on Kaggle: Phishing Emails Dataset

  • Description: The dataset contains legitimate and phishing email samples with their corresponding labels.
  • Size: The dataset includes a comprehensive range of phishing and non-phishing emails.
  • Structure:
    • Inputs: Email content.
    • Labels: Binary classification (0 = Legitimate, 1 = Phishing).

The dataset was cleaned and preprocessed to remove noise, and tokenized using the DistilBERT tokenizer for training.


Training Details

  • Base Model: DistilBERT
  • Training Framework: Hugging Face Transformers.
  • Training Metrics:
    • Epochs: 3
    • Batch Size: (Specify if known)
    • Learning Rate: (Specify if known)

Performance Metrics

  • Final Results:

    • Validation Loss: 0.0266
    • Validation Accuracy: 99.39%
    • Evaluation Runtime: 12.73 seconds
    • Samples per Second: 345.02
    • Steps per Second: 5.42
  • Epoch-wise Metrics:

    Epoch Training Loss Validation Loss Validation Accuracy
    1 0.0015 0.0329 98.91%
    2 0.0006 0.0328 99.18%
    3 0.0003 0.0266 99.39%

The model exhibits strong performance, with consistent improvements in validation accuracy and loss across epochs, indicating effective fine-tuning.


How to Use

from transformers import pipeline

# Load the pipeline with your model
pipe = pipeline("text-classification", model="rahulkothuri/phishing-email-disilBERT")

# Input: Email content
email_content = "'Why do employees leave companies — analysis of IBM employee data"
output = pipe(email_content, top_k=None)

print(output)

Limitations

  • Biases in the Dataset: The dataset may not cover all variations of phishing emails, potentially leading to lower accuracy on unseen types of phishing attacks.
  • Language Limitations: This model is trained on English emails and may not perform well with emails in other languages.
  • Context Understanding: The model relies solely on text and cannot account for contextual cues (e.g., links, attachments).

Ethical Considerations

  • False Positives: Legitimate emails classified as phishing could lead to inconvenience.
  • False Negatives: Failure to detect a phishing email could lead to security risks. Users are encouraged to use this model in conjunction with other security measures.

Citation

If you use this model, please cite the dataset:

@dataset{kaggle_phishing_emails,
  author = {Subha Journal},
  title = {Phishing Emails Dataset},
  year = {2022},
  url = {https://www.kaggle.com/datasets/subhajournal/phishingemails}
}

Downloads last month
26
Safetensors
Model size
65.8M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for rahulkothuri/phishing-email-disilBERT

Finetuned
(224)
this model

Dataset used to train rahulkothuri/phishing-email-disilBERT