license: mit
base_model: microsoft/Florence-2-base
TinyClick: Single-Turn Agent for Empowering GUI Automation
The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation
About The Project
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.
Usage
To set up the environment for running the code, please refer to the GitHub repository. All necessary libraries and dependencies are listed in the requirements.txt file
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(
"Samsung/TinyClick", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"Samsung/TinyClick",
trust_remote_code=True,
).to(device)
url = "https://huggingface.co/Samsung/TinyClick/resolve/main/sample.png"
img = Image.open(requests.get(url, stream=True).raw)
command = "click on accept and continue button"
image_size = img.size
input_text = ("What to do to execute the command? " + command.strip()).lower()
inputs = processor(
images=img,
text=input_text,
return_tensors="pt",
do_resize=True,
)
outputs = model.generate(**inputs)
generated_texts = processor.batch_decode(outputs, skip_special_tokens=False)
For postprocessing fuction go to our github repository: https://github.com/SamsungLabs/TinyClick
from tinyclick_utils import postprocess
result = postprocess(generated_texts[0], image_size)
Citation
@misc{pawlowski2024tinyclicksingleturnagentempowering,
title={TinyClick: Single-Turn Agent for Empowering GUI Automation},
author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz},
year={2024},
eprint={2410.11871},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2410.11871},
}
License
Please check the MIT license that is listed in this repository. See LICENSE
for more information.