UI-TARS-7B-SFT

UI-TARS-2B-SFT  |  UI-TARS-2B-gguf  |  UI-TARS-7B-SFT  |  UI-TARS-7B-DPO  |  UI-TARS-7B-gguf  |  UI-TARS-72B-SFT  |  UI-TARS-72B-DPO

Introduction

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.

Performance

Perception Capabilty Evaluation

Model VisualWebBench WebSRC SQAshort
Qwen2-VL-7B 73.3 81.8 84.9
Qwen-VL-Max 74.1 91.1 78.6
Gemini-1.5-Pro 75.4 88.9 82.2
UIX-Qwen2-7B 75.9 82.9 78.8
Claude-3.5-Sonnet 78.2 90.4 83.1
GPT-4o 78.5 87.7 82.3
UI-TARS-2B 72.9 89.2 86.4
UI-TARS-7B 79.7 93.6 87.7
UI-TARS-72B 82.8 89.3 88.6

Grounding Capability Evaluation

  • ScreenSpot Pro
Agent Model Dev-Text Dev-Icon Dev-Avg Creative-Text Creative-Icon Creative-Avg CAD-Text CAD-Icon CAD-Avg Scientific-Text Scientific-Icon Scientific-Avg Office-Text Office-Icon Office-Avg OS-Text OS-Icon OS-Avg Avg-Text Avg-Icon Avg
QwenVL-7B 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1
GPT-4o 1.3 0.0 0.7 1.0 0.0 0.6 2.0 0.0 1.5 2.1 0.0 1.2 1.1 0.0 0.9 0.0 0.0 0.0 1.3 0.0 0.8
SeeClick 0.6 0.0 0.3 1.0 0.0 0.6 2.5 0.0 1.9 3.5 0.0 2.0 1.1 0.0 0.9 2.8 0.0 1.5 1.8 0.0 1.1
Qwen2-VL-7B 2.6 0.0 1.3 1.5 0.0 0.9 0.5 0.0 0.4 6.3 0.0 3.5 3.4 1.9 3.0 0.9 0.0 0.5 2.5 0.2 1.6
OS-Atlas-4B 7.1 0.0 3.7 3.0 1.4 2.3 2.0 0.0 1.5 9.0 5.5 7.5 5.1 3.8 4.8 5.6 0.0 3.1 5.0 1.7 3.7
ShowUI-2B 16.9 1.4 9.4 9.1 0.0 5.3 2.5 0.0 1.9 13.2 7.3 10.6 15.3 7.5 13.5 10.3 2.2 6.6 10.8 2.6 7.7
CogAgent-18B 14.9 0.7 8.0 9.6 0.0 5.6 7.1 3.1 6.1 22.2 1.8 13.4 13.0 0.0 10.0 5.6 0.0 3.1 12.0 0.8 7.7
Aria-UI 16.2 0.0 8.4 23.7 2.1 14.7 7.6 1.6 6.1 27.1 6.4 18.1 20.3 1.9 16.1 4.7 0.0 2.6 17.1 2.0 11.3
UGround-7B 26.6 2.1 14.7 27.3 2.8 17.0 14.2 1.6 11.1 31.9 2.7 19.3 31.6 11.3 27.0 17.8 0.0 9.7 25.0 2.8 16.5
Claude Computer Use 22.0 3.9 12.6 25.9 3.4 16.8 14.5 3.7 11.9 33.9 15.8 25.8 30.1 16.3 26.9 11.0 4.5 8.1 23.4 7.1 17.1
OS-Atlas-7B 33.1 1.4 17.7 28.8 2.8 17.9 12.2 4.7 10.3 37.5 7.3 24.4 33.9 5.7 27.4 27.1 4.5 16.8 28.1 4.0 18.9
UGround-V1-7B - - 35.5 - - 27.8 - - 13.5 - - 38.8 - - 48.8 - - 26.1 - - 31.1
UI-TARS-2B 47.4 4.1 26.4 42.9 6.3 27.6 17.8 4.7 14.6 56.9 17.3 39.8 50.3 17.0 42.6 21.5 5.6 14.3 39.6 8.4 27.7
UI-TARS-7B 58.4 12.4 36.1 50.0 9.1 32.8 20.8 9.4 18.0 63.9 31.8 50.0 63.3 20.8 53.5 30.8 16.9 24.5 47.8 16.2 35.7
UI-TARS-72B 63.0 17.3 40.8 57.1 15.4 39.6 18.8 12.5 17.2 64.6 20.9 45.7 63.3 26.4 54.8 42.1 15.7 30.1 50.9 17.5 38.1
  • ScreenSpot
Method Mobile-Text Mobile-Icon/Widget Desktop-Text Desktop-Icon/Widget Web-Text Web-Icon/Widget Avg
Agent Framework
GPT-4 (SeeClick) 76.6 55.5 68.0 28.6 40.9 23.3 48.8
GPT-4 (OmniParser) 93.9 57.0 91.3 63.6 81.3 51.0 73.0
GPT-4 (UGround-7B) 90.1 70.3 87.1 55.7 85.7 64.6 75.6
GPT-4o (SeeClick) 81.0 59.8 69.6 33.6 43.9 26.2 52.3
GPT-4o (UGround-7B) 93.4 76.9 92.8 67.9 88.7 68.9 81.4
Agent Model
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
CogAgent 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
Qwen2-VL 75.5 60.7 76.3 54.3 35.2 25.7 55.3
UGround-7B 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Aguvis-G-7B 88.3 78.2 88.1 70.7 85.7 74.8 81.8
OS-Atlas-7B 93.0 72.9 91.8 62.9 90.9 74.3 82.5
Claude Computer Use - - - - - - 83.0
Gemini 2.0 (Project Mariner) - - - - - - 84.0
Aguvis-7B 95.6 77.7 93.8 67.1 88.3 75.2 84.4
Aguvis-72B 94.5 85.2 95.4 77.9 91.3 85.9 89.2
Our Model
UI-TARS-2B 93.0 75.5 90.7 68.6 84.3 74.8 82.3
UI-TARS-7B 94.5 85.2 95.9 85.7 90.0 83.5 89.5
UI-TARS-72B 94.9 82.5 89.7 88.6 88.7 85.0 88.4
  • ScreenSpot v2
Method Mobile-Text Mobile-Icon/Widget Desktop-Text Desktop-Icon/Widget Web-Text Web-Icon/Widget Avg
Agent Framework
GPT-4o (SeeClick) 85.2 58.8 79.9 37.1 72.7 30.1 63.6
GPT-4o (OS-Atlas-4B) 95.5 75.8 79.4 49.3 90.2 66.5 79.1
GPT-4o (OS-Atlas-7B) 96.2 83.4 89.7 69.3 94.0 79.8 87.1
Agent Model
SeeClick 78.4 50.7 70.1 29.3 55.2 32.5 55.1
OS-Atlas-4B 87.2 59.7 72.7 46.4 85.9 63.1 71.9
OS-Atlas-7B 95.2 75.8 90.7 63.6 90.6 77.3 84.1
Our Model
UI-TARS-2B 95.2 79.1 90.7 68.6 87.2 78.3 84.7
UI-TARS-7B 96.9 89.1 95.4 85.0 93.6 85.2 91.6
UI-TARS-72B 94.8 86.3 91.2 87.9 91.5 87.7 90.3

Offline Agent Capability Evaluation

  • Multimodal Mind2Web
Method Cross-Task Ele.Acc Cross-Task Op.F1 Cross-Task Step SR Cross-Website Ele.Acc Cross-Website Op.F1 Cross-Website Step SR Cross-Domain Ele.Acc Cross-Domain Op.F1 Cross-Domain Step SR
Agent Framework
GPT-4o (SeeClick) 32.1 - - 33.1 - - 33.5 - -
GPT-4o (UGround) 47.7 - - 46.0 - - 46.6 - -
GPT-4o (Aria-UI) 57.6 - - 57.7 - - 61.4 - -
GPT-4V (OmniParser) 42.4 87.6 39.4 41.0 84.8 36.5 45.5 85.7 42.0
Agent Model
GPT-4o 5.7 77.2 4.3 5.7 79.0 3.9 5.5 86.4 4.5
GPT-4 (SOM) 29.6 - 20.3 20.1 - 13.9 27.0 - 23.7
GPT-3.5 (Text-only) 19.4 59.2 16.8 14.9 56.5 14.1 25.2 57.9 24.1
GPT-4 (Text-only) 40.8 63.1 32.3 30.2 61.0 27.0 35.4 61.9 29.7
Claude 62.7 84.7 53.5 59.5 79.6 47.7 64.5 85.4 56.4
Aguvis-7B 64.2 89.8 60.4 60.7 88.1 54.6 60.4 89.2 56.6
CogAgent - - 62.3 - - 54.0 - - 59.4
Aguvis-72B 69.5 90.8 64.0 62.6 88.6 56.5 63.5 88.5 58.2
Our Model
UI-TARS-2B 62.3 90.0 56.3 58.5 87.2 50.8 58.8 89.6 52.3
UI-TARS-7B 73.1 92.2 67.1 68.2 90.9 61.7 66.6 90.9 60.5
UI-TARS-72B 74.7 92.5 68.6 72.4 91.2 63.5 68.9 91.8 62.1
  • Android Control and GUI Odyssey
Agent Models AndroidControl-Low Type AndroidControl-Low Grounding AndroidControl-Low SR AndroidControl-High Type AndroidControl-High Grounding AndroidControl-High SR GUIOdyssey Type GUIOdyssey Grounding GUIOdyssey SR
Claude 74.3 0.0 19.4 63.7 0.0 12.5 60.9 0.0 3.1
GPT-4o 74.3 0.0 19.4 66.3 0.0 20.8 34.3 0.0 3.3
SeeClick 93.0 73.4 75.0 82.9 62.9 59.1 71.0 52.4 53.9
InternVL-2-4B 90.9 84.1 80.1 84.1 72.7 66.7 82.1 55.5 51.5
Qwen2-VL-7B 91.9 86.5 82.6 83.8 77.7 69.7 83.5 65.9 60.2
Aria-UI -- 87.7 67.3 -- 43.2 10.2 -- 86.8 36.5
OS-Atlas-4B 91.9 83.8 80.6 84.7 73.8 67.5 83.5 61.4 56.4
OS-Atlas-7B 93.6 88.0 85.2 85.2 78.5 71.2 84.5 67.8 62.0
Aguvis-7B -- -- 80.5 -- -- 61.5 -- -- --
Aguvis-72B -- -- 84.4 -- -- 66.4 -- -- --
UI-TARS-2B 98.1 87.3 89.3 81.2 78.4 68.9 93.9 86.8 83.4
UI-TARS-7B 98.0 89.3 90.8 83.7 80.5 72.5 94.6 90.1 87.0
UI-TARS-72B 98.1 89.9 91.3 85.2 81.5 74.7 95.4 91.4 88.6

Online Agent Capability Evaluation

Method OSWorld (Online) AndroidWorld (Online)
Agent Framework
GPT-4o (UGround) - 32.8
GPT-4o (Aria-UI) 15.2 44.8
GPT-4o (Aguvis-7B) 14.8 37.1
GPT-4o (Aguvis-72B) 17.0 -
GPT-4o (OS-Atlas-7B) 14.6 -
Agent Model
GPT-4o 5.0 34.5 (SoM)
Gemini-Pro-1.5 5.4 22.8 (SoM)
Aguvis-72B 10.3 26.1
Claude Computer-Use 14.9 (15 steps) 27.9
Claude Computer-Use 22.0 (50 steps) -
Our Model
UI-TARS-7B-SFT 17.7 (15 steps) 33.0
UI-TARS-7B-DPO 18.7 (15 steps) -
UI-TARS-72B-SFT 18.8 (15 steps) 46.6
UI-TARS-72B-DPO 22.7 (15 steps) -

Citation

If you find our paper and model useful in your research, feel free to give us a cite.

@article{uitars2025,
  author    = {Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi},
  title     = {UI-TARS: An End-to-End Framework for Autonomous GUI Agents with System-2 Reasoning and Iterative Reflection Tuning},
  url       = {https://github.com/bytedance/UI-TARS},
  year      = {2025}
}
Downloads last month
0
Safetensors
Model size
8.29B params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.