VLM_S2H
Collection
Model checkpoints for "Generalizing from SIMPLE to HARD Visual Reasoning"
•
9 items
•
Updated
This model follows the adapter-based VLM structure from LLaVA and Eagle. This model uses meta-llama/Meta-Llama-3-8B-Instruct as the base LLM and CLIP-448 (based on CLIP-336) and ConvNeXt as the visual encoders.
We trained Eagle-X2-Llama3-8B on 160k examples of Mix supervision on Consecutive Table Readout.
Paper: Generalizing from SIMPLE to HARD Visual Reasoning
@misc{park2025generalizingsimplehardvisual,
title={Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?},
author={Simon Park and Abhishek Panigrahi and Yun Cheng and Dingli Yu and Anirudh Goyal and Sanjeev Arora},
year={2025},
eprint={2501.02669},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.02669},
}
Simon Park, Princeton University
Abhishek Panigrahi, Princeton University
Yun Cheng, Princeton University
{juhyunp, ap34, yc6206} 'at' princeton 'dot' edu
Base model
meta-llama/Meta-Llama-3-8B-Instruct