Papers
arxiv:2406.08478

What If We Recaption Billions of Web Images with LLaMA-3?

Published on Jun 12, 2024
ยท Submitted by akhaliq on Jun 13, 2024
Authors:
,
,
,

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

Community

Dear authors,

Thank you for your excellent work and the detailed analysis of the re-captioned datasets. I particularly appreciated your insights on the recaptioning pipeline and the training process for CLIP. I noticed that your work closely relates to the VeCLIP paper, which might be of interest to you. You can find our paper: https://arxiv.org/abs/2310.07699 and code: https://github.pie.apple.com/aiml-oss/ml-veclip. Thanks!

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.08478 in a Space README.md to link it from this page.

Collections including this paper 8