File size: 3,281 Bytes
1dc3d75
 
f6fbbd9
1dc3d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30e8d3e
1dc3d75
 
 
 
 
30e8d3e
1dc3d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b85bd5f
1dc3d75
0b38fa9
1dc3d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77fadb
1dc3d75
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: mit
base_model: microsoft/Florence-2-base
---

<a id="readme-top"></a>

[![arXiv][paper-shield]][paper-url]
[![MIT License][license-shield]][license-url]

<!-- PROJECT LOGO -->
<br />
<div align="center">
  <!-- <a href="https://github.com/othneildrew/Best-README-Template">
    <img src="images/logo.png" alt="Logo" width="80" height="80">
  </a> -->
  <h3 align="center">TinyClick: Single-Turn Agent for Empowering GUI Automation</h3>
  <p align="center">
    The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation
  </p>
</div>


<!-- ABOUT THE PROJECT -->
## About The Project

We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.


<!-- USAGE EXAMPLES -->
## Usage
To set up the environment for running the code, please refer to the [GitHub repository](https://github.com/SamsungLabs/TinyClick). All necessary libraries and dependencies are listed in the requirements.txt file

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(
    "Samsung/TinyClick", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Samsung/TinyClick",
    trust_remote_code=True,
).to(device)

url = "https://huggingface.co/Samsung/TinyClick/resolve/main/sample.png"
img = Image.open(requests.get(url, stream=True).raw)

command = "click on accept and continue button"
image_size = img.size

input_text = ("What to do to execute the command? " + command.strip()).lower()

inputs = processor(
    images=img,
    text=input_text,
    return_tensors="pt",
    do_resize=True,
)

outputs = model.generate(**inputs)
generated_texts = processor.batch_decode(outputs, skip_special_tokens=False)
```

For postprocessing fuction go to our github repository: https://github.com/SamsungLabs/TinyClick
```python
from tinyclick_utils import postprocess

result = postprocess(generated_texts[0], image_size)
```

<!-- CITATION -->
## Citation

```
@misc{pawlowski2024tinyclicksingleturnagentempowering,
    title={TinyClick: Single-Turn Agent for Empowering GUI Automation}, 
    author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz},
    year={2024},
    eprint={2410.11871},
    archivePrefix={arXiv},
    primaryClass={cs.HC},
    url={https://arxiv.org/abs/2410.11871}, 
}
```


<!-- LICENSE -->
## License

Please check the MIT license that is listed in this repository. See `LICENSE` for more information.

<p align="right">(<a href="#readme-top">back to top</a>)</p>


<!-- MARKDOWN LINKS & IMAGES -->
[paper-shield]: https://img.shields.io/badge/2024-arXiv-red
[paper-url]: https://arxiv.org/abs/2410.11871
[license-shield]: https://img.shields.io/badge/License-MIT-yellow.svg
[license-url]: https://opensource.org/licenses/MIT