Document info

6/2/2023

We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection, where significant improvements and new SOTA results have been achieved. We propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. Image Transformer has recently achieved considerable progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. TrOCR is convolution free and can be easily adapted for multilingual text recognition as well as cloud/edge deployment. Recently, we presented our latest research for OCR, namely TrOCR, which is a Transformer-based OCR with a pre-trained image Transformer and a text Transformer. Instead, MarkupLM takes advantage of the tree-based markup structures to model the relationship among different units within the document.

Distinct from fixed-layout documents, markup-based documents provide another viewpoint for the document representation learning through markup structures because the 2D position information and document image information cannot be used straightforwardly during the pre-training. Moreover, MarkupLM is also proposed to jointly pre-train text and markup language in a single framework for markup-based VrDU tasks. The LayoutLM/LayoutXLM model family has been applied to a wide range of Document AI applications, including table detection, page object detection, LayoutReader for reading order detection, form/receipt/invoice understanding, complex document understanding, document image classification, document VQA, etc., meanwhile achieving state-of-the-art performance across these benchmarks. In addition to the benchmark datasets, we also proposed the multimodal Document Foundation Model, including the pre-trained LayoutLM model family for Document AI which has been widely adopted by 1 st and 3 rd party products and applications in Azure AI, such as Form Recognizer. Recently, we released two new benchmark datasets, where ReadingBank for the reading order detection task, and XFUND for the multi-lingual form understanding task that contains forms in seven languages. Starting in 2019, we released two benchmark datasets TableBank and DocBank, which are used for table detection and recognition as well as the page object detection for documents. Understanding business documents is an incredibly challenging task due to the diversity of layouts and formats, inferior quality of scanned document images as well as the complexity of template structures. Document AI, or Document Intelligence, is a new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents.

0 Comments

Document info

Leave a Reply.

Author

Archives

Categories