euclid-multimodal
/

Euclid-convnext-xxlarge-120524

Question Answering

text-generation

Inference Endpoints

Model card Files Files and versions Community

Euclid-convnext-xxlarge-120524 / README.md

jrzhang's picture

Update README.md

c37ff5c verified 27 days ago

|

history blame contribute delete

3.26 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	- laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
	pipeline_tag: question-answering
	metrics:
	- accuracy
	library_name: transformers
	---

	[Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions](https://arxiv.org/abs/2412.08737)

	# Model Card for Euclid-convnext-xxlarge (Version on 12/05/2024)

	A multimodal large language models specifically trained for strong low-level geometric perception.

	## Model Details

	### Model Description

	Euclid is trained on 1.6M synthetic geometry images with high-fidelity question-answer pairs using a curriculum learning approach.

	It combines a ConvNeXt visual encoder with a Qwen-2.5 language model, connected through a 2-layer MLP multimodal connector.


	### Model Sources

	- Repository: https://github.com/euclid-multimodal/Euclid
	- Paper: https://arxiv.org/abs/2412.08737
	- Demo: https://euclid-multimodal.github.io/

	## Uses

	The model is trained for precise low-level geometric perception tasks which is able to perform
	- Point-on-line detection
	- Point-on-circle detection
	- Angle classification
	- Length comparison
	- Geometric annotation understanding

	Please refer to our [repo](https://github.com/euclid-multimodal/Euclid) for full input format.

	### Limitations and Applications

	Our model is not designed to handle:
	- Comprehensive image understanding tasks
	- Advanced cognitive reasoning beyond geometric analysis

	However, the model demonstrates strength in low-level visual perception.

	This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including:

	- Robotic vision and automation systems
	- Medical imaging and diagnostic support
	- Industrial quality assurance and inspection
	- Geometric education and visualization tools

	### Example Usage

	Clone our Euclid [repo](https://github.com/euclid-multimodal/Euclid) first, set up the environment, then run:

	```
	pip install -U "huggingface_hub[cli]"
	huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
	python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda
	```



	## Evaluation Results

	Performance on Geoperception benchmark tasks:

	\| Model \| POL \| POC \| ALC \| LHC \| PEP \| PRA \| EQL \| Overall \|
	\|-------\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|----------\|
	\| Random Baseline \| 0.43 \| 2.63 \| 59.92 \| 51.36 \| 0.25 \| 0.00 \| 0.02 \| 16.37 \|
	\| Pixtral-12B \| 22.85 \| 53.21 \| 47.33 \| 51.43 \| 22.53 \| 37.11 \| 58.45 \| 41.84 \|
	\| Gemini-1.5-Pro \| 24.42 \| 69.80 \| 57.96 \| 79.05 \| 39.60 \| 77.59 \| 52.27 \| 57.24 \|
	\| EUCLID-ConvNeXt-Large \| 80.54 \| 57.76 \| 86.37 \| 88.24 \| 42.23 \| 64.94 \| 34.45 \| 64.93 \|
	\| EUCLID-ConvNeXt-XXLarge \| 82.98 \| 61.45 \| 90.56 \| 90.82 \| 46.96 \| 70.52 \| 31.94 \| 67.89 \|


	## Citation

	If you find Euclid useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{zhang2024euclid,
	title={Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions},
	author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
	journal={arXiv preprint arXiv:2412.08737},
	year={2024}
	}