--- datasets: - pierreguillou/DocLayNet-base metrics: - accuracy base_model: - google/vit-base-patch16-224-in21k library_name: transformers tags: - vision - document-layout-analysis - document-classification - vit - doclaynet --- # Vision Transformer(ViT) for Document Classification(DocLayNet) This model is a fine-tuned Vision Transformer (ViT) for document layout classification based on the DocLayNet dataset. Trained on images of the document categories from DocLayNet dataset where the categories namely(with their indexes) are : ```python {'financial_reports': 0, 'government_tenders': 1, 'laws_and_regulations': 2, 'manuals': 3, 'patents': 4, 'scientific_articles': 5} ``` ## Model description This model is built upon the `google/vit-base-patch16-224-in21k` Vision Transformer architecture and fine-tuned specifically for document layout classification. The base ViT model uses a patch size of 16x16 pixels and was pre-trained on ImageNet-21k. The model has been optimized to recognize and classify different types of document layouts from the DocLayNet dataset. ## Training data The model was trained on DocLayNet-base dataset, which is available on the Hugging Face Hub: [pierreguillou/DocLayNet-base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) DocLayNet is a comprehensive dataset for document layout analysis, containing various document types and their corresponding layout annotations. ## Training procedure Trained for 10 epochs on a single gpu for ~10 mins. The training hyperparameters: ```python { 'batch_size': 64, 'num_epochs': 20, 'learning_rate': 1e-4, 'weight_decay': 0.05, 'warmup_ratio': 0.2, 'gradient_clip': 0.1, 'dropout_rate': 0.1, 'label_smoothing': 0.1, 'optimizer': 'AdamW' } ``` ## Evaluation results The model achieved the following performance metrics on the test set: Test Loss: 0.8622 Test Accuracy: 81.36% ## Usage ```python from transformers import pipeline # Load the model using the image-classification pipeline pipe = pipeline("image-classification", model="kaixkhazaki/vit_doclaynet_base") # Test it with an image result = pipe("path_to_image.jpg") print(result) ```