|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
- laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup |
|
pipeline_tag: question-answering |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
--- |
|
|
|
[Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions](https://arxiv.org/abs/2412.08737) |
|
|
|
# Model Card for Euclid-convnext-xxlarge (Version on 12/05/2024) |
|
|
|
A multimodal large language models specifically trained for strong low-level geometric perception. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Euclid is trained on 1.6M synthetic geometry images with high-fidelity question-answer pairs using a curriculum learning approach. |
|
|
|
It combines a ConvNeXt visual encoder with a Qwen-2.5 language model, connected through a 2-layer MLP multimodal connector. |
|
|
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/euclid-multimodal/Euclid |
|
- **Paper:** https://arxiv.org/abs/2412.08737 |
|
- **Demo:** https://euclid-multimodal.github.io/ |
|
|
|
## Uses |
|
|
|
The model is trained for precise low-level geometric perception tasks which is able to perform |
|
- Point-on-line detection |
|
- Point-on-circle detection |
|
- Angle classification |
|
- Length comparison |
|
- Geometric annotation understanding |
|
|
|
Please refer to our [repo](https://github.com/euclid-multimodal/Euclid) for full input format. |
|
|
|
### Limitations and Applications |
|
|
|
Our model is not designed to handle: |
|
- Comprehensive image understanding tasks |
|
- Advanced cognitive reasoning beyond geometric analysis |
|
|
|
However, the model demonstrates strength in low-level visual perception. |
|
|
|
This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including: |
|
|
|
- Robotic vision and automation systems |
|
- Medical imaging and diagnostic support |
|
- Industrial quality assurance and inspection |
|
- Geometric education and visualization tools |
|
|
|
### Example Usage |
|
|
|
Clone our Euclid [repo](https://github.com/euclid-multimodal/Euclid) first, set up the environment, then run: |
|
|
|
``` |
|
pip install -U "huggingface_hub[cli]" |
|
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge |
|
python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda |
|
``` |
|
|
|
|
|
|
|
## Evaluation Results |
|
|
|
Performance on Geoperception benchmark tasks: |
|
|
|
| Model | POL | POC | ALC | LHC | PEP | PRA | EQL | Overall | |
|
|-------|-----|-----|-----|-----|-----|-----|-----|----------| |
|
| Random Baseline | 0.43 | 2.63 | 59.92 | 51.36 | 0.25 | 0.00 | 0.02 | 16.37 | |
|
| Pixtral-12B | 22.85 | 53.21 | 47.33 | 51.43 | 22.53 | 37.11 | **58.45** | 41.84 | |
|
| Gemini-1.5-Pro | 24.42 | **69.80** | 57.96 | 79.05 | **39.60** | **77.59** | 52.27 | 57.24 | |
|
| EUCLID-ConvNeXt-Large | 80.54 | 57.76 | 86.37 | 88.24 | 42.23 | 64.94 | 34.45 | 64.93 | |
|
| EUCLID-ConvNeXt-XXLarge | **82.98** | 61.45 | **90.56** | **90.82** | **46.96** | 70.52 | 31.94 | **67.89** | |
|
|
|
|
|
## Citation |
|
|
|
If you find Euclid useful for your research and applications, please cite using this BibTeX: |
|
```bibtex |
|
@article{zhang2024euclid, |
|
title={Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions}, |
|
author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie}, |
|
journal={arXiv preprint arXiv:2412.08737}, |
|
year={2024} |
|
} |