FinTabQA: Financial Table Question-Answering

A model for financial table question-answering using the LayoutLM architecture.

Quick start

To get started with FinTabQA, load it, and a fast tokenizer, like you would any other Hugging Face Transformer model and tokenizer. Below is a minimum working example using the SynFinTabs dataset.

>>> from typing import List, Tuple
>>> from datasets import load_dataset
>>> from transformers import LayoutLMForQuestionAnswering, LayoutLMTokenizerFast
>>> import torch
>>> 
>>> synfintabs_dataset = load_dataset("ethanbradley/synfintabs")
>>> model = LayoutLMForQuestionAnswering.from_pretrained("ethanbradley/fintabqa")
>>> tokenizer = LayoutLMTokenizerFast.from_pretrained(
...     "microsoft/layoutlm-base-uncased")
>>> 
>>> def normalise_boxes(
...         boxes: List[List[int]],
...         old_image_size: Tuple[int, int],
...         new_image_size: Tuple[int, int]) -> List[List[int]]:
...     old_im_w, old_im_h = old_image_size
...     new_im_w, new_im_h = new_image_size
... 
...     return [[
...         max(min(int(x1 / old_im_w * new_im_w), new_im_w), 0),
...         max(min(int(y1 / old_im_h * new_im_h), new_im_h), 0),
...         max(min(int(x2 / old_im_w * new_im_w), new_im_w), 0),
...         max(min(int(y2 / old_im_h * new_im_h), new_im_h), 0)
...     ] for (x1, y1, x2, y2) in boxes]
>>> 
>>> item = synfintabs_dataset['test'][0]
>>> question_dict = next(question for question in item['questions']
...     if question['id'] == item['question_id'])
>>> encoding = tokenizer(
...     question_dict['question'].split(),
...     item['ocr_results']['words'],
...     max_length=512,
...     padding="max_length",
...     truncation="only_second",
...     is_split_into_words=True,
...     return_token_type_ids=True,
...     return_tensors="pt")
>>> 
>>> word_boxes = normalise_boxes(
...     item['ocr_results']['bboxes'],
...     item['image'].crop(item['bbox']).size,
...     (1000, 1000))
>>> token_boxes = []
>>> 
>>> for i, s, w in zip(
...         encoding['input_ids'][0],
...         encoding.sequence_ids(0),
...         encoding.word_ids(0)):
...     if s == 1:
...         token_boxes.append(word_boxes[w])
...     elif i == tokenizer.sep_token_id:
...         token_boxes.append([1000] * 4)
...     else:
...         token_boxes.append([0] * 4)
>>> 
>>> encoding['bbox'] = torch.tensor([token_boxes])
>>> outputs = model(**encoding)
>>> start = encoding.word_ids(0)[outputs['start_logits'].argmax(-1)]
>>> end = encoding.word_ids(0)[outputs['end_logits'].argmax(-1)]
>>> 
>>> print(f"Target: {question_dict['answer']}")
Target: 6,980
>>> 
>>> print(f"Prediction: {' '.join(item['ocr_results']['words'][start : end])}")
Prediction: 6,980

Citation

If you use this model, please cite both the article using the citation below and the model itself.

@misc{bradley2024synfintabs,
      title         = {Syn{F}in{T}abs: A Dataset of Synthetic Financial Tables for Information and Table Extraction},
      author        = {Bradley, Ethan and Roman, Muhammad and Rafferty, Karen and Devereux, Barry},
      year          = {2024},
      eprint        = {2412.04262},
      archivePrefix = {arXiv},
      primaryClass  = {cs.LG},
      url           = {https://arxiv.org/abs/2412.04262}
}
Downloads last month
51
Safetensors
Model size
113M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ethanbradley/fintabqa

Finetuned
(146)
this model

Dataset used to train ethanbradley/fintabqa