arxiv:2501.05122

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Published on Jan 9

· Submitted by

Gregor on Jan 10

Upvote

Authors:

Gregor Geigle ,

Abstract

Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.

View arXiv page View PDF Add to collection

Community

Gregor

Paper author Paper submitter about 10 hours ago

We are presenting "Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model".

Multilingual large vision-language models have to be trained with multilingual data but how should this data look like? How many languages? How much should be multilingual? What about multilingual text in images?

In this work we first extensively explore the design space for the multilingual training data and then apply those lessons two train Centurio, two state-of-the-art multilingual LVLMs based on Aya-Expanse and Qwen 2.5.

For a summary of our results, see here. Our model checkpoints can be found in this HuggingFace collection.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Abstract

Community

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1