maxiw
's Collections
Research on GUI Models
updated
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
82
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
76
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
46
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
46
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper
•
2409.08264
•
Published
•
43
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
39
CogAgent: A Visual Language Model for GUI Agents
Paper
•
2312.08914
•
Published
•
29
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
•
2410.05243
•
Published
•
17
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper
•
2401.10935
•
Published
•
4
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
Paper
•
2210.03347
•
Published
•
3
Ferret-UI 2: Mastering Universal User Interface Understanding Across
Platforms
Paper
•
2410.18967
•
Published
•
1
ScreenAgent: A Vision Language Model-driven Computer Control Agent
Paper
•
2402.07945
•
Published
From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces
Paper
•
2306.00245
•
Published
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Paper
•
2406.10819
•
Published
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
Paper
•
2209.08199
•
Published