Papers
arxiv:2307.15818

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Published on Jul 28, 2023
· Submitted by akhaliq on Aug 1, 2023
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

Community

Paper author

For more videos and an interactive demo, see: https://robotics-transformer2.github.io/

Introduces RT-2 (Robotics Transformer-2) VLA (vision language action) models: co-fine-tune large VLMs (LVLMs and LLMs) on internet scale image data (vision-language tasks) and robotic trajectory data for emergent semantic reasoning (for robotics); also has experiments with CoT reasoning on VLAs. Representing robot actions as language tokens allows joint training and knowledge transfer from LVLMs; output action tokens; vision, action, and languages bonded in one model. Uses two VLMs as base: PaLI-X (multi-lingual VLM - 5B and 55B) and PaLM-E (embodies multi-modal LLM - 12B). Action encoding based on RT-1: bin the action space into 256 bins (for each of 7 numbers - end effector X, Y, and Z position, X, Y, and Z rotation, and end effector/gripper extension); convert this number representation to a single string (to define a target for VLM fine-tuning) - space separated integers. For PaLI-X, directly associate action bins to tokens (ints upto 1000 have unique tokens), for PaLM-E, overwrite 256 least frequently used tokens for this int token space. Apply output constraint by sampling only valid action tokens from the prompt's result (during decoding); real constraints (motion and end-effector pose) for execution on the robot. RT-2-PaLI-X-55B runs on multi-TPU (GCP) and networked to the robot, allows 1-3 Hz control loop; 5B model can run at 5 Hz. Co-fine-tuned on the training data of PaLI-X and PaLM-E (original data, everything from Google Everyday Robots), and extra data from robot demonstrations from RT-1 (trajectory annotated with natural language instruction and egocentric image). Compared against RT-1, VC-1 (artificial visual cortex), R3M (representation for robot manipulation), and MOO (different VLM architecture - semantic map with RT-1). Generalization scenarios over seen tasks (from data, but slightly different factors like robot position and time of day): comparable to RT-1; RT-2 is best for unseen objects, backgrounds, and environments. Discretize and store actions as text (e.g.: movement over cartesian coordinates on a table) for open source language table benchmark; best performance compared to BC-zero (zero-shot generalization with imitation learning), RT-1, and LAVA (interactive language). Evaluates emergent abilities for object reasoning, semantic/symbol understanding, and human recognition; RT-2-PaLI-X-55B has better generalization performance (success rate) compared to RT-2-PaLM-E-12B (both are much better than RT-1 and VC-1). Incorporate chain of thought (CoT) by using "instruction, plan, and action" instead of just "instruction and action" (data augmentation scheme); you can inspect reasoning through planned prediction. Appendix includes contributions, datasets, baseline implementation details, VLM information (candidates for RT-2), training (co-fine-tuning) details, evaluation settings and results (qualitative results, failure cases, quantitative experimental results), emergent evaluations, and CoT results. From Google DeepMind.

Links: Website (demo), blog, arxiv, HuggingFace Papers, PapersWithCode, Unofficial/community GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.15818 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2307.15818 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2307.15818 in a Space README.md to link it from this page.

Collections including this paper 1