-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 57 -
Tree Search for Language Model Agents
Paper • 2407.01476 • Published -
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper • 2401.10935 • Published • 4 -
OmniParser for Pure Vision Based GUI Agent
Paper • 2408.00203 • Published • 24
Collections
Discover the best community collections!
Collections including paper arxiv:2409.08264
-
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published -
Mapping Natural Language Commands to Web Elements
Paper • 1808.09132 • Published
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 40 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 24 -
Mind2Web: Towards a Generalist Agent for the Web
Paper • 2306.06070 • Published • 19 -
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Paper • 2410.13232 • Published • 41
-
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 44 -
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Paper • 2409.15278 • Published • 24 -
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper • 2410.08164 • Published • 24
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper • 2409.08513 • Published • 12 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 44 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
LLMs + Persona-Plug = Personalized LLMs
Paper • 2409.11901 • Published • 32
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 39 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper • 2303.17651 • Published • 2 -
Automating Thought of Search: A Journey Towards Soundness and Completeness
Paper • 2408.11326 • Published • 1 -
Building Math Agents with Multi-Turn Iterative Preference Learning
Paper • 2409.02392 • Published • 15
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 56 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 124
-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Paper • 2408.11812 • Published • 5 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 24 -
Agent Workflow Memory
Paper • 2409.07429 • Published • 29 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 44
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 39 -
On the limits of agency in agent-based models
Paper • 2409.10568 • Published • 13 -
On the Diagram of Thought
Paper • 2409.10038 • Published • 13 -
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Paper • 2409.07703 • Published • 67
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 83 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 77 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 46 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 46