arxiv:2403.16516

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Published on Mar 25, 2024

Authors:

Abstract

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

View arXiv page View PDF Add to collection

Community

bubbleMilkTea

Sep 23, 2024

Can anyone explain the difference between ViTLP and Microsoft's Azure AI Document Intelligence Layout model? Which one performs better? Also, is there an open-source alternative to the Azure AI Document Intelligence Layout model?

veason

Sep 24, 2024

•

edited Sep 24, 2024

I am the author of ViTLP. I have no idea about "Microsoft's Azure AI Document Intelligence Layout model". NEVER HEARD OF IT. For the difference, you should ask what you mentioned "MS Document team".

As I have stated in the ViTLP repo https://github.com/Veason-silverbullet/ViTLP, we submitted the first version of the ViTLP paper in Jun 2023 on https://openreview.net/forum?id=ARtBIBAmNR. Unfortunately, at that review, I received some picky and malicious paper reviews, that was why ViTLP was not released to the public until this year.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.16516 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.16516 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.