|
--- |
|
license: mit |
|
base_model: microsoft/Florence-2-base |
|
--- |
|
|
|
<a id="readme-top"></a> |
|
|
|
[![arXiv][paper-shield]][paper-url] |
|
[![MIT License][license-shield]][license-url] |
|
|
|
<!-- PROJECT LOGO --> |
|
<br /> |
|
<div align="center"> |
|
<!-- <a href="https://github.com/othneildrew/Best-README-Template"> |
|
<img src="images/logo.png" alt="Logo" width="80" height="80"> |
|
</a> --> |
|
<h3 align="center">TinyClick: Single-Turn Agent for Empowering GUI Automation</h3> |
|
<p align="center"> |
|
The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation |
|
</p> |
|
</div> |
|
|
|
|
|
<!-- ABOUT THE PROJECT --> |
|
## About The Project |
|
|
|
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency. |
|
|
|
|
|
<!-- USAGE EXAMPLES --> |
|
## Usage |
|
To set up the environment for running the code, please refer to the [GitHub repository](https://github.com/SamsungLabs/TinyClick). All necessary libraries and dependencies are listed in the requirements.txt file |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
processor = AutoProcessor.from_pretrained( |
|
"Samsung/TinyClick", trust_remote_code=True |
|
) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"Samsung/TinyClick", |
|
trust_remote_code=True, |
|
).to(device) |
|
|
|
url = "https://huggingface.co/Samsung/TinyClick/resolve/main/sample.png" |
|
img = Image.open(requests.get(url, stream=True).raw) |
|
|
|
command = "click on accept and continue button" |
|
image_size = img.size |
|
|
|
input_text = ("What to do to execute the command? " + command.strip()).lower() |
|
|
|
inputs = processor( |
|
images=img, |
|
text=input_text, |
|
return_tensors="pt", |
|
do_resize=True, |
|
) |
|
|
|
outputs = model.generate(**inputs) |
|
generated_texts = processor.batch_decode(outputs, skip_special_tokens=False) |
|
``` |
|
|
|
For postprocessing fuction go to our github repository: https://github.com/SamsungLabs/TinyClick |
|
```python |
|
from tinyclick_utils import postprocess |
|
|
|
result = postprocess(generated_texts[0], image_size) |
|
``` |
|
|
|
<!-- CITATION --> |
|
## Citation |
|
|
|
``` |
|
@misc{pawlowski2024tinyclicksingleturnagentempowering, |
|
title={TinyClick: Single-Turn Agent for Empowering GUI Automation}, |
|
author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz}, |
|
year={2024}, |
|
eprint={2410.11871}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.HC}, |
|
url={https://arxiv.org/abs/2410.11871}, |
|
} |
|
``` |
|
|
|
|
|
<!-- LICENSE --> |
|
## License |
|
|
|
Please check the MIT license that is listed in this repository. See `LICENSE` for more information. |
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p> |
|
|
|
|
|
<!-- MARKDOWN LINKS & IMAGES --> |
|
[paper-shield]: https://img.shields.io/badge/2024-arXiv-red |
|
[paper-url]: https://arxiv.org/abs/2410.11871 |
|
[license-shield]: https://img.shields.io/badge/License-MIT-yellow.svg |
|
[license-url]: https://opensource.org/licenses/MIT |
|
|