rahulkothuri/phishing-email-disilBERT

DistilBERT Phishing Email Classifier

Model Overview

This is a fine-tuned DistilBERT model for phishing email detection. The model identifies whether an email is a phishing attempt based on its text. Phishing emails are malicious messages designed to deceive recipients into revealing sensitive information or taking unsafe actions. This model aims to assist in automatically detecting such emails to enhance cybersecurity.

Dataset

The model was fine-tuned on the Phishing Emails Dataset available on Kaggle: Phishing Emails Dataset

Description: The dataset contains legitimate and phishing email samples with their corresponding labels.
Size: The dataset includes a comprehensive range of phishing and non-phishing emails.
Structure:
- Inputs: Email content.
- Labels: Binary classification (0 = Legitimate, 1 = Phishing).

The dataset was cleaned and preprocessed to remove noise, and tokenized using the DistilBERT tokenizer for training.

Training Details

Base Model: DistilBERT
Training Framework: Hugging Face Transformers.
Training Metrics:
- Epochs: 3
- Batch Size: (Specify if known)
- Learning Rate: (Specify if known)

Performance Metrics

Final Results:
- Validation Loss: 0.0266
- Validation Accuracy: 99.39%
- Evaluation Runtime: 12.73 seconds
- Samples per Second: 345.02
- Steps per Second: 5.42
Epoch-wise Metrics:

Epoch Training Loss Validation Loss Validation Accuracy

1 0.0015 0.0329 98.91%

2 0.0006 0.0328 99.18%

3 0.0003 0.0266 99.39%

Epoch	Training Loss	Validation Loss	Validation Accuracy
1	0.0015	0.0329	98.91%
2	0.0006	0.0328	99.18%
3	0.0003	0.0266	99.39%

The model exhibits strong performance, with consistent improvements in validation accuracy and loss across epochs, indicating effective fine-tuning.

How to Use

from transformers import pipeline

# Load the pipeline with your model
pipe = pipeline("text-classification", model="rahulkothuri/phishing-email-disilBERT")

# Input: Email content
email_content = "'Why do employees leave companies — analysis of IBM employee data"
output = pipe(email_content, top_k=None)

print(output)

Limitations

Biases in the Dataset: The dataset may not cover all variations of phishing emails, potentially leading to lower accuracy on unseen types of phishing attacks.
Language Limitations: This model is trained on English emails and may not perform well with emails in other languages.
Context Understanding: The model relies solely on text and cannot account for contextual cues (e.g., links, attachments).

Ethical Considerations

False Positives: Legitimate emails classified as phishing could lead to inconvenience.
False Negatives: Failure to detect a phishing email could lead to security risks. Users are encouraged to use this model in conjunction with other security measures.

Citation

If you use this model, please cite the dataset:

@dataset{kaggle_phishing_emails,
  author = {Subha Journal},
  title = {Phishing Emails Dataset},
  year = {2022},
  url = {https://www.kaggle.com/datasets/subhajournal/phishingemails}
}

rahulkothuri
/

phishing-email-disilBERT