TinyClick / README.md
pawlowskipawel's picture
Update README.md
a77fadb verified
---
license: mit
base_model: microsoft/Florence-2-base
---
<a id="readme-top"></a>
[![arXiv][paper-shield]][paper-url]
[![MIT License][license-shield]][license-url]
<!-- PROJECT LOGO -->
<br />
<div align="center">
<!-- <a href="https://github.com/othneildrew/Best-README-Template">
<img src="images/logo.png" alt="Logo" width="80" height="80">
</a> -->
<h3 align="center">TinyClick: Single-Turn Agent for Empowering GUI Automation</h3>
<p align="center">
The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation
</p>
</div>
<!-- ABOUT THE PROJECT -->
## About The Project
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.
<!-- USAGE EXAMPLES -->
## Usage
To set up the environment for running the code, please refer to the [GitHub repository](https://github.com/SamsungLabs/TinyClick). All necessary libraries and dependencies are listed in the requirements.txt file
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(
"Samsung/TinyClick", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"Samsung/TinyClick",
trust_remote_code=True,
).to(device)
url = "https://huggingface.co/Samsung/TinyClick/resolve/main/sample.png"
img = Image.open(requests.get(url, stream=True).raw)
command = "click on accept and continue button"
image_size = img.size
input_text = ("What to do to execute the command? " + command.strip()).lower()
inputs = processor(
images=img,
text=input_text,
return_tensors="pt",
do_resize=True,
)
outputs = model.generate(**inputs)
generated_texts = processor.batch_decode(outputs, skip_special_tokens=False)
```
For postprocessing fuction go to our github repository: https://github.com/SamsungLabs/TinyClick
```python
from tinyclick_utils import postprocess
result = postprocess(generated_texts[0], image_size)
```
<!-- CITATION -->
## Citation
```
@misc{pawlowski2024tinyclicksingleturnagentempowering,
title={TinyClick: Single-Turn Agent for Empowering GUI Automation},
author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz},
year={2024},
eprint={2410.11871},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2410.11871},
}
```
<!-- LICENSE -->
## License
Please check the MIT license that is listed in this repository. See `LICENSE` for more information.
<p align="right">(<a href="#readme-top">back to top</a>)</p>
<!-- MARKDOWN LINKS & IMAGES -->
[paper-shield]: https://img.shields.io/badge/2024-arXiv-red
[paper-url]: https://arxiv.org/abs/2410.11871
[license-shield]: https://img.shields.io/badge/License-MIT-yellow.svg
[license-url]: https://opensource.org/licenses/MIT