|
--- |
|
license: cc-by-nc-sa-4.0 |
|
pipeline_tag: image-to-video |
|
tags: |
|
- turing |
|
- autonomous driving |
|
- video generation |
|
- world model |
|
--- |
|
|
|
# Terra |
|
|
|
<div id="scroll-container" style="display: flex; overflow-x: auto; gap: 10px; scroll-behavior: smooth; padding: 10px; border: 1px solid #ddd;"> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_1.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_2.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_3.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_4.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_5.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_6.mp4" type="video/mp4"> |
|
</video> |
|
<video width="512" controls> |
|
<source src="https://huggingface.co/turing-motors/Terra/resolve/main/assets/videos/row_7.mp4" type="video/mp4"> |
|
</video> |
|
</div> |
|
|
|
**Terra** is a world model designed for autonomous driving and serves as a baseline model in th [ACT-Bench](https://github.com/turingmotors/ACT-Bench) framework. |
|
Terra generates video continuations based on short video clips of approximately three frames and trajectory instructions. |
|
A key feature of Terra is its **high adherence to trajectory instructions**, enabling accurate and reliable action-conditioned video generation. |
|
|
|
We have developed two versions of the Terra model to date. The `v1` model, as detailed in the paper, exhibits a bias towards generating videos that veer to the right. To address this issue, we introduced the `v2` model, incorporating slight architectural modifications to mitigate this tendency and produce more balanced outputs. The performance of each model is outlined below. |
|
|
|
||Vista|Terra(v1)|Terra(v2)| |
|
|---|---|---|---| |
|
|Accuracy (β)| 0.307 |0.441 | **0.632** | |
|
|ADE (β)|4.50 | 3.98 | **3.86** | |
|
|FDE (β)|8.66|8.21| **8.05**| |
|
|
|
## Related Links |
|
|
|
For more technical details and discussions, please refer to: |
|
- **Paper:** https://arxiv.org/abs/2412.05337 |
|
- **Code:** https://github.com/turingmotors/ACT-Bench |
|
|
|
## How to use |
|
|
|
We have verified the execution on a machine equipped with a single NVIDIA H100 80GB GPU. However, we believe it should be possible to run the model on any machine equipped with an NVIDIA GPU with 16GB or more of VRAM. |
|
|
|
Terra consists of an Image Tokenizer, an Autoregressive Transformer, and a Video Refiner. Due to the complexity of setting up the Video Refiner, we have not include its implementation in this Hugging Face repository. Instead, **the implementation and setup instructions for the Video Refiner are provided in [ACT-Bench repository](https://github.com/turingmotors/ACT-Bench)**. Here, we provide an example of generating video continuations using the Image Tokenizer and the Autoregressive Transformer, conditioned on image frames and a template trajectory. The resulting video quality might seem suboptimal as each frame is decoded individually. To improve the visual quality, you can use Video Refiner. |
|
|
|
### Install Packages |
|
|
|
We use [uv](https://docs.astral.sh/uv/) to manage python packages. If you don't have uv installed in your environment, please see the document of it. |
|
|
|
```shell |
|
$ git clone https://huggingface.co/turing-motors/Terra |
|
$ uv sync |
|
``` |
|
|
|
### Action-Conditioned Video Generation without Video Refiner |
|
|
|
```shell |
|
$ python inference.py |
|
``` |
|
|
|
This command generates a video using three image frames located in [`assets/conditioning_frames`](./assets/conditioning_frames/) and the `curving_to_left/curving_to_left_moderate` trajectory defined in the trajectory template file [`assets/template_trajectory.json`](./assets/template_trajectory.json). |
|
|
|
You can find more details by referring to the [`inference.py`](./inference.py) script. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{arai2024actbench, |
|
title={ACT-Bench: Towards Action Controllable World Models for Autonomous Driving}, |
|
author={Hidehisa Arai and Keishi Ishihara and Tsubasa Takahashi and Yu Yamaguchi}, |
|
year={2024}, |
|
eprint={2412.05337}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.05337}, |
|
} |
|
``` |