Model Card for Eagle-X2-Llama3-8B-ConsecutiveTableReadout-Mix-160k

This model follows the adapter-based VLM structure from LLaVA and Eagle. This model uses meta-llama/Meta-Llama-3-8B-Instruct as the base LLM and CLIP-448 (based on CLIP-336) and ConvNeXt as the visual encoders.

Training Details

We trained Eagle-X2-Llama3-8B on 120k examples of Align-Mix+ supervision on Visual Analogy.

Citation

Paper: Generalizing from SIMPLE to HARD Visual Reasoning

@misc{park2025generalizingsimplehardvisual,
      title={Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?}, 
      author={Simon Park and Abhishek Panigrahi and Yun Cheng and Dingli Yu and Anirudh Goyal and Sanjeev Arora},
      year={2025},
      eprint={2501.02669},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02669}, 
}

Contact

Simon Park, Princeton University

Abhishek Panigrahi, Princeton University

Yun Cheng, Princeton University

{juhyunp, ap34, yc6206} 'at' princeton 'dot' edu

Downloads last month
12
Safetensors
Model size
9.21B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for PrincetonPLI/Eagle-X2-Llama3-8B-VisualAnalogy-AlignMixPlus-120k

Finetuned
(7)
this model

Collection including PrincetonPLI/Eagle-X2-Llama3-8B-VisualAnalogy-AlignMixPlus-120k