numbmelon commited on
Commit
ec44dfb
·
verified ·
1 Parent(s): 35d996d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -1
README.md CHANGED
@@ -5,4 +5,138 @@ base_model: OpenGVLab/InternVL2-4B
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
- This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
+ This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
+
14
+ </div>
15
+
16
+ ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
17
+
18
+ ## Quick Start
19
+ OS-Atlas-Base-4B is a GUI grounding model finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B).
20
+
21
+ **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
22
+
23
+ ### Inference Example
24
+ First, install the `transformers` library:
25
+ ```
26
+ pip install transformers
27
+ ```
28
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
29
+
30
+ Inference code example:
31
+ ```python
32
+ import numpy as np
33
+ import torch
34
+ import torchvision.transforms as T
35
+ from PIL import Image
36
+ from torchvision.transforms.functional import InterpolationMode
37
+ from transformers import AutoModel, AutoTokenizer
38
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
39
+ IMAGENET_STD = (0.229, 0.224, 0.225)
40
+
41
+ def build_transform(input_size):
42
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
43
+ transform = T.Compose([
44
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
45
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
46
+ T.ToTensor(),
47
+ T.Normalize(mean=MEAN, std=STD)
48
+ ])
49
+ return transform
50
+
51
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
52
+ best_ratio_diff = float('inf')
53
+ best_ratio = (1, 1)
54
+ area = width * height
55
+ for ratio in target_ratios:
56
+ target_aspect_ratio = ratio[0] / ratio[1]
57
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
58
+ if ratio_diff < best_ratio_diff:
59
+ best_ratio_diff = ratio_diff
60
+ best_ratio = ratio
61
+ elif ratio_diff == best_ratio_diff:
62
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
63
+ best_ratio = ratio
64
+ return best_ratio
65
+
66
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
67
+ orig_width, orig_height = image.size
68
+ aspect_ratio = orig_width / orig_height
69
+
70
+ # calculate the existing image aspect ratio
71
+ target_ratios = set(
72
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
73
+ i * j <= max_num and i * j >= min_num)
74
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
75
+
76
+ # find the closest aspect ratio to the target
77
+ target_aspect_ratio = find_closest_aspect_ratio(
78
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
79
+
80
+ # calculate the target width and height
81
+ target_width = image_size * target_aspect_ratio[0]
82
+ target_height = image_size * target_aspect_ratio[1]
83
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
84
+
85
+ # resize the image
86
+ resized_img = image.resize((target_width, target_height))
87
+ processed_images = []
88
+ for i in range(blocks):
89
+ box = (
90
+ (i % (target_width // image_size)) * image_size,
91
+ (i // (target_width // image_size)) * image_size,
92
+ ((i % (target_width // image_size)) + 1) * image_size,
93
+ ((i // (target_width // image_size)) + 1) * image_size
94
+ )
95
+ # split the image
96
+ split_img = resized_img.crop(box)
97
+ processed_images.append(split_img)
98
+ assert len(processed_images) == blocks
99
+ if use_thumbnail and len(processed_images) != 1:
100
+ thumbnail_img = image.resize((image_size, image_size))
101
+ processed_images.append(thumbnail_img)
102
+ return processed_images
103
+
104
+ def load_image(image_file, input_size=448, max_num=12):
105
+ image = Image.open(image_file).convert('RGB')
106
+ transform = build_transform(input_size=input_size)
107
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
108
+ pixel_values = [transform(image) for image in images]
109
+ pixel_values = torch.stack(pixel_values)
110
+ return pixel_values
111
+
112
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
113
+ path = 'OS-Copilot/OS-Atlas-Base-4B'
114
+ model = AutoModel.from_pretrained(
115
+ path,
116
+ torch_dtype=torch.bfloat16,
117
+ low_cpu_mem_usage=True,
118
+ trust_remote_code=True).eval().cuda()
119
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
120
+
121
+ # set the max number of tiles in `max_num`
122
+ pixel_values = load_image('https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
123
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
124
+
125
+ question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\n\"'Champions League' link\""
126
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
127
+ print(f'User: {question}\nAssistant: {response}')
128
+ ```
129
+
130
+
131
+
132
+
133
+ ## Citation
134
+ If you find this repository helpful, feel free to cite our paper:
135
+ ```bibtex
136
+ @article{wu2024atlas,
137
+ title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
138
+ author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
139
+ journal={arXiv preprint arXiv:2410.23218},
140
+ year={2024}
141
+ }
142
+ ```