--- license: mit language: - en --- # Model Card for Model ID This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels. ## Model Details ### Model Description The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes. - **Developed by:** DAXA.AI - **Funded by:** Open Source - **Model type:** Classification model - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** distilbert-base-uncased ### Model Sources - **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you) - **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier) ## Uses ### Intended Use The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning. ### Recommendations End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended. ## How to Get Started with the Model Use the code below to get started with the model. ```python # Import necessary libraries from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import joblib from huggingface_hub import hf_hub_url, cached_download # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier") model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier") # Example text text = "Please enter your text here." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) # Apply softmax to the logits probabilities = torch.nn.functional.softmax(output.logits, dim=-1) # Get the predicted label predicted_label = torch.argmax(probabilities, dim=-1) # URL of your Hugging Face model repository REPO_NAME = "daxa-ai/pebblo-classifier" # Path to the label encoder file in the repository LABEL_ENCODER_FILE = "label encoder.joblib" # Construct the URL to the label encoder file url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE) # Download and cache the label encoder file filename = cached_download(url) # Load the label encoder label_encoder = joblib.load(filename) # Decode the predicted label decoded_label = label_encoder.inverse_transform(predicted_label.numpy()) print(decoded_label) ``` ## Training Details ### Training Data The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20). Here are the labels along with their respective counts in the dataset: | Agreement Type | Instances | | --------------------------------------- | --------- | | BOARD_MEETING_AGREEMENT | 4,225 | | CONSULTING_AGREEMENT | 2,965 | | CUSTOMER_LIST_AGREEMENT | 9,000 | | DISTRIBUTION_PARTNER_AGREEMENT | 8,339 | | EMPLOYEE_AGREEMENT | 3,921 | | ENTERPRISE_AGREEMENT | 3,820 | | ENTERPRISE_LICENSE_AGREEMENT | 9,000 | | EXECUTIVE_SEVERANCE_AGREEMENT | 9,000 | | FINANCIAL_REPORT_AGREEMENT | 8,381 | | HARMFUL_ADVICE | 2,025 | | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 7,037 | | LOAN_AND_SECURITY_AGREEMENT | 9,000 | | MEDICAL_ADVICE | 2,359 | | MERGER_AGREEMENT | 7,706 | | NDA_AGREEMENT | 2,966 | | NORMAL_TEXT | 6,742 | | PATENT_APPLICATION_FILLINGS_AGREEMENT | 9,000 | | PRICE_LIST_AGREEMENT | 9,000 | | SETTLEMENT_AGREEMENT | 9,000 | | SEXUAL_HARRASSMENT | 8,321 | ## Evaluation ### Testing Data & Metrics #### Testing Data Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness. Here are the labels along with their respective counts in the dataset: | Agreement Type | Instances | | --------------------------------------- | --------- | | BOARD_MEETING_AGREEMENT | 4,335 | | CONSULTING_AGREEMENT | 1,533 | | CUSTOMER_LIST_AGREEMENT | 4,995 | | DISTRIBUTION_PARTNER_AGREEMENT | 7,231 | | EMPLOYEE_AGREEMENT | 1,433 | | ENTERPRISE_AGREEMENT | 1,616 | | ENTERPRISE_LICENSE_AGREEMENT | 8,574 | | EXECUTIVE_SEVERANCE_AGREEMENT | 5,177 | | FINANCIAL_REPORT_AGREEMENT | 4,264 | | HARMFUL_ADVICE | 474 | | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 4,116 | | LOAN_AND_SECURITY_AGREEMENT | 6,354 | | MEDICAL_ADVICE | 289 | | MERGER_AGREEMENT | 7,079 | | NDA_AGREEMENT | 1,452 | | NORMAL_TEXT | 1,808 | | PATENT_APPLICATION_FILLINGS_AGREEMENT | 6,177 | | PRICE_LIST_AGREEMENT | 5,453 | | SETTLEMENT_AGREEMENT | 5,806 | | SEXUAL_HARRASSMENT | 4,750 | #### Metrics | Agreement Type | precision | recall | f1-score | support | | ------------------------------------------- | --------- | ------ | -------- | ------- | | BOARD_MEETING_AGREEMENT | 0.93 | 0.95 | 0.94 | 4335 | | CONSULTING_AGREEMENT | 0.72 | 0.98 | 0.84 | 1593 | | CUSTOMER_LIST_AGREEMENT | 0.64 | 0.82 | 0.72 | 4335 | | DISTRIBUTION_PARTNER_AGREEMENT | 0.83 | 0.47 | 0.61 | 7231 | | EMPLOYEE_AGREEMENT | 0.78 | 0.92 | 0.85 | 1333 | | ENTERPRISE_AGREEMENT | 0.29 | 0.40 | 0.34 | 1616 | | ENTERPRISE_LICENSE_AGREEMENT | 0.88 | 0.79 | 0.83 | 5574 | | EXECUTIVE_SERVICE_AGREEMENT | 0.92 | 0.85 | 0.89 | 8177 | | FINANCIAL_REPORT_AGREEMENT | 0.89 | 0.98 | 0.93 | 4264 | | HARMFUL_ADVICE | 0.79 | 0.95 | 0.86 | 474 | | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 0.91 | 0.98 | 0.94 | 4116 | | LOAN_AND_SECURITY_AGREEMENT | 0.77 | 0.98 | 0.86 | 6354 | | MEDICAL_ADVICE | 0.81 | 0.99 | 0.89 | 289 | | MERGER_AGREEMENT | 0.89 | 0.77 | 0.83 | 7279 | | NDA_AGREEMENT | 0.70 | 0.57 | 0.62 | 1452 | | NORMAL_TEXT | 0.79 | 0.97 | 0.87 | 1888 | | PATENT_APPLICATION_FILLINGS_AGREEMENT | 0.95 | 0.99 | 0.97 | 6177 | | PRICE_LIST_AGREEMENT | 0.60 | 0.75 | 0.67 | 5565 | | SETTLEMENT_AGREEMENT | 0.82 | 0.54 | 0.65 | 5843 | | SEXUAL_HARASSMENT | 0.97 | 0.94 | 0.95 | 440 | | | | | | | | accuracy | | | 0.79 | 82916 | | macro avg | 0.79 | 0.83 | 0.80 | 82916 | | weighted avg | 0.83 | 0.81 | 0.81 | 82916 | #### Results The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.