CogACT-Large / README.md

Update the paper link in README.md.

0a7e03c verified about 1 month ago

4.07 kB

	---
	license: mit
	library_name: transformers
	tags:
	- robotics
	- vla
	- diffusion
	- multimodal
	- pretraining
	language:
	- en
	pipeline_tag: robotics
	---
	# CogACT-Large

	CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Large employs a [DiT-L](https://github.com/facebookresearch/DiT) model as the action module.

	All our [code](https://github.com/microsoft/CogACT), [pretrained model weights](https://huggingface.co/CogACT), are licensed under the MIT license.

	Please refer to our [project page](https://cogact.github.io/) and [paper](https://arxiv.org/abs/2411.19650) for more details.


	## Model Summary

	- Developed by: The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/).
	- Model type: Vision-Language-Action (language, image => robot actions)
	- Language(s) (NLP): en
	- License: MIT
	- Model components:
	+ Vision Backbone: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14
	+ Language Model: Llama-2
	+ Action Model: DiT-Large
	- Pretraining Dataset: A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/)
	- Repository: [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT)
	- Paper: [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://arxiv.org/abs/2411.19650)
	- Project Page: [https://cogact.github.io/](https://cogact.github.io/)

	## Uses
	CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas
	of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics.

	CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information.

	Here is a simple example for inference.

	```python
	# Please clone and install dependencies in our repo
	# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)

	from PIL import Image
	from vla import load_vla
	import torch

	model = load_vla(
	'CogACT/CogACT-Large',
	load_for_training=False,
	action_model_type='DiT-L',
	future_action_window_size=15,
	)
	# about 30G Memory in fp32;

	# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16

	model.to('cuda:0').eval()

	image: Image.Image = <input_your_image>
	prompt = "move sponge near apple" # input your prompt

	# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
	actions, _ = model.predict_action(
	image,
	prompt,
	unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
	cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well
	use_ddim = True, # use DDIM sampling
	num_ddim_steps = 10, # number of steps for DDIM sampling
	)

	# results in 7-DoF actions of 16 steps with shape [16, 7]
	```

	## Citation

	```bibtex
	@article{li2024cogact,
	title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
	author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
	journal={arXiv preprint arXiv:2411.19650},
	year={2024}
	}
	}
	```