osunlp/UGround-V1-2B
Image-Text-to-Text
•
Updated
•
116
•
5
UGround: Universal GUI Visual Grounding for GUI Agents
Note Based on Qwen2-VL-2B-Instruct
Note Based on Qwen2-VL-7B-Instruct
Note Based on Qwen2-VL-72B-Instruct
Note The initial model. Based on the modified LLaVA arch (CLIP + Vicuna-7B) describe in the paper
Note Low-cost, scalable and effective data synthesis pipeline for GUI visaul grounding; SOTA GUI visual grounding model UGround; purely vision-only (modular) GUI agent framework SeeAct-V; first time demonstrating SOTA performance of vision-only GUI agents.
Note Paused. Will open a new one for Qwen2-VL-based UGround
Note Paused. Trying to figure out how to accelerate the inference. And will open a new one for UGround-V1.1