HuggingFaceTB/SmolVLM-Instruct · OCR Grounding + "A Bounding Box is Worth One Token"

Hello,

SmolVLM is a great candidate Intelligent Document Processing (OCR -> structured output) at scale however a good discussion from LocalLama sub-Reddit has raised few interesting points to be careful of for this use-case:
https://www.reddit.com/r/LocalLLaMA/comments/1hjfirl/comment/m36dovv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
In short:
• Using LLMs for OCR is not recommended due to significant limitations compared to traditional OCR methods.
• Traditional OCR accurately fits each character without altering the original text, while LLM-based OCR uses high-dimensional embeddings leading to information loss and inaccuracies.
• LLM-based OCR struggles with real human inputs by introducing errors, correcting typos, and losing original layout. It performs better on simple AI-generated text.
• These models cannot detect their own inaccuracies, making it difficult to identify and rectify issues.
• Some tools attempt to mitigate this by combining classic OCR with LLM embeddings, but overall relying solely on LLMs for OCR leads to unreliable results.
• LMMs primarily rely on semantic understanding rather than direct character recognition, performing poorly on non-semantic text combinations.
• They struggle with handwritten text, Chinese text, and other languages beyond English.
• The limited input resolution of most LMMs constrains their ability to capture fine details in images.

In order to reduce these risks, OCR grounding could be used as a first step before being forwarded to SmolVLM.
In addition, an interesting way forward has been brought up in https://arxiv.org/abs/2407.01976 - "A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding" - https://github.com/LayTextLLM/LayTextLLM but it didn't get much highlight in the community it seems.

I would like to get the team opinion on these 2 additional methods to improve SmolVLM accuracy & reduce hallucination to the minimum ? One problem that I could see is that OCR grounding + bounding boxes would consume a lot of input tokens.
Is there anything else worse exploring for pure Documents -> Structured data ?

Cheers,