CogACT commited on
Commit
c8e85e0
·
verified ·
1 Parent(s): 8d7fb74

Update Citation in README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -90
README.md CHANGED
@@ -1,91 +1,92 @@
1
- ---
2
- license: mit
3
- library_name: transformers
4
- tags:
5
- - robotics
6
- - vla
7
- - diffusion
8
- - multimodal
9
- - pretraining
10
- language:
11
- - en
12
- pipeline_tag: robotics
13
- ---
14
- # CogACT-Large
15
-
16
- CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Large employs a [DiT-L](https://github.com/facebookresearch/DiT) model as the action module.
17
-
18
- All our [code](https://github.com/microsoft/CogACT), [pretrained model weights](https://huggingface.co/CogACT), are licensed under the MIT license.
19
-
20
- Please refer to our [project page](https://cogact.github.io/) and [paper](https://cogact.github.io/CogACT_paper.pdf) for more details.
21
-
22
-
23
- ## Model Summary
24
-
25
- - **Developed by:** The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/).
26
- - **Model type:** Vision-Language-Action (language, image => robot actions)
27
- - **Language(s) (NLP):** en
28
- - **License:** MIT
29
- - **Model components:**
30
- + **Vision Backbone**: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14
31
- + **Language Model**: Llama-2
32
- + **Action Model**: DiT-Large
33
- - **Pretraining Dataset:** A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/)
34
- - **Repository:** [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT)
35
- - **Paper:** [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://cogact.github.io/CogACT_paper.pdf)
36
- - **Project Page:** [https://cogact.github.io/](https://cogact.github.io/)
37
-
38
- ## Uses
39
- CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas
40
- of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics.
41
-
42
- CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information.
43
-
44
- Here is a simple example for inference.
45
-
46
- ```python
47
- # Please clone and install dependencies in our repo
48
- # Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
49
-
50
- from PIL import Image
51
- from vla import load_vla
52
- import torch
53
-
54
- model = load_vla(
55
- 'CogACT/CogACT-Large',
56
- load_for_training=False,
57
- action_model_type='DiT-L',
58
- future_action_window_size=15,
59
- )
60
- # about 30G Memory in fp32;
61
-
62
- # (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16
63
-
64
- model.to('cuda:0').eval()
65
-
66
- image: Image.Image = <input_your_image>
67
- prompt = "move sponge near apple" # input your prompt
68
-
69
- # Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
70
- actions, _ = model.predict_action(
71
- image,
72
- prompt,
73
- unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
74
- cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well
75
- use_ddim = True, # use DDIM sampling
76
- num_ddim_steps = 10, # number of steps for DDIM sampling
77
- )
78
-
79
- # results in 7-DoF actions of 16 steps with shape [16, 7]
80
- ```
81
-
82
- ## Citation
83
-
84
- ```bibtex
85
- @article{li2024cogact,
86
- title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
87
- author={Qixiu, Li and Yaobo, Liang and Zeyu, Wang and Lin, Luo and Xi, Chen and Mozheng, Liao and Fangyun, Wei and Yu, Deng and Sicheng, Xu and Yizhong, Zhang and Xiaofan, Wang and Bei, Liu and Jianlong, Fu and Jianmin, Bao and Dong, Chen and Yuanchun, Shi and Jiaolong, Yang and Baining, Guo},
88
- journal={arXiv preprint},
89
- year={2024},
90
- }
 
91
  ```
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ tags:
5
+ - robotics
6
+ - vla
7
+ - diffusion
8
+ - multimodal
9
+ - pretraining
10
+ language:
11
+ - en
12
+ pipeline_tag: robotics
13
+ ---
14
+ # CogACT-Large
15
+
16
+ CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Large employs a [DiT-L](https://github.com/facebookresearch/DiT) model as the action module.
17
+
18
+ All our [code](https://github.com/microsoft/CogACT), [pretrained model weights](https://huggingface.co/CogACT), are licensed under the MIT license.
19
+
20
+ Please refer to our [project page](https://cogact.github.io/) and [paper](https://cogact.github.io/CogACT_paper.pdf) for more details.
21
+
22
+
23
+ ## Model Summary
24
+
25
+ - **Developed by:** The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/).
26
+ - **Model type:** Vision-Language-Action (language, image => robot actions)
27
+ - **Language(s) (NLP):** en
28
+ - **License:** MIT
29
+ - **Model components:**
30
+ + **Vision Backbone**: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14
31
+ + **Language Model**: Llama-2
32
+ + **Action Model**: DiT-Large
33
+ - **Pretraining Dataset:** A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/)
34
+ - **Repository:** [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT)
35
+ - **Paper:** [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://cogact.github.io/CogACT_paper.pdf)
36
+ - **Project Page:** [https://cogact.github.io/](https://cogact.github.io/)
37
+
38
+ ## Uses
39
+ CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas
40
+ of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics.
41
+
42
+ CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information.
43
+
44
+ Here is a simple example for inference.
45
+
46
+ ```python
47
+ # Please clone and install dependencies in our repo
48
+ # Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
49
+
50
+ from PIL import Image
51
+ from vla import load_vla
52
+ import torch
53
+
54
+ model = load_vla(
55
+ 'CogACT/CogACT-Large',
56
+ load_for_training=False,
57
+ action_model_type='DiT-L',
58
+ future_action_window_size=15,
59
+ )
60
+ # about 30G Memory in fp32;
61
+
62
+ # (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16
63
+
64
+ model.to('cuda:0').eval()
65
+
66
+ image: Image.Image = <input_your_image>
67
+ prompt = "move sponge near apple" # input your prompt
68
+
69
+ # Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
70
+ actions, _ = model.predict_action(
71
+ image,
72
+ prompt,
73
+ unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
74
+ cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well
75
+ use_ddim = True, # use DDIM sampling
76
+ num_ddim_steps = 10, # number of steps for DDIM sampling
77
+ )
78
+
79
+ # results in 7-DoF actions of 16 steps with shape [16, 7]
80
+ ```
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @article{li2024cogact,
86
+ title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
87
+ author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
88
+ journal={arXiv preprint arXiv:2411.19650},
89
+ year={2024}
90
+ }
91
+ }
92
  ```