Vision Transformer(ViT) for Document Classification(DocLayNet)

This model is a fine-tuned Vision Transformer (ViT) for document layout classification based on the DocLayNet dataset.

Trained on images of the document categories from DocLayNet dataset where the categories namely(with their indexes) are :

{'financial_reports': 0,
 'government_tenders': 1,
 'laws_and_regulations': 2,
 'manuals': 3,
 'patents': 4,
 'scientific_articles': 5} 

Model description

This model is built upon the google/vit-base-patch16-224-in21k Vision Transformer architecture and fine-tuned specifically for document layout classification. The base ViT model uses a patch size of 16x16 pixels and was pre-trained on ImageNet-21k. The model has been optimized to recognize and classify different types of document layouts from the DocLayNet dataset.

Training data

The model was trained on DocLayNet-base dataset, which is available on the Hugging Face Hub: pierreguillou/DocLayNet-base

DocLayNet is a comprehensive dataset for document layout analysis, containing various document types and their corresponding layout annotations.

Training procedure

Trained for 10 epochs on a single gpu for ~10 mins.

The training hyperparameters:

{
    'batch_size': 64,
    'num_epochs': 20,
    'learning_rate': 1e-4,
    'weight_decay': 0.05,
    'warmup_ratio': 0.2,
    'gradient_clip': 0.1,
    'dropout_rate': 0.1,
    'label_smoothing': 0.1,
    'optimizer': 'AdamW'
}

Evaluation results

The model achieved the following performance metrics on the test set:

Test Loss: 0.8622 Test Accuracy: 81.36%

Usage

from transformers import pipeline

# Load the model using the image-classification pipeline
pipe = pipeline("image-classification", model="kaixkhazaki/vit_doclaynet_base")

# Test it with an image
result = pipe("path_to_image.jpg")
print(result)
Downloads last month
21
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for kaixkhazaki/vit_doclaynet_base

Finetuned
(1790)
this model

Dataset used to train kaixkhazaki/vit_doclaynet_base