File size: 3,610 Bytes

---
license: mit
datasets:
- yunzhong-hou/DroneMotion-99k
tags:
- AIGC
- videography
- drone
- drone video
- UAV
---

# DVGFormer: Learning Camera Movement Control from Real-World Drone Videos


<!-- <a href="https://arxiv.org/abs/2412.09620"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a>
<a href="https://dvgformer.github.io/"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge" height=22.5></a> -->
[**Paper**](https://arxiv.org/abs/2412.09620) | [**Project Page**](https://dvgformer.github.io/) | [**GitHub**](https://github.com/hou-yz/dvgformer) | [**Data**](https://huggingface.co/datasets/yunzhong-hou/DroneMotion-99k)

Official implementation of our paper: 
<br>**Learning Camera Movement Control from Real-World Drone Videos**<br>
[**Yunzhong Hou**](https://hou-yz.github.io/), [**Liang Zheng**](https://zheng-lab-anu.github.io/), [**Philip Torr**](https://eng.ox.ac.uk/people/philip-torr/)<br>

*"To record as is, not to create from scratch."*
![Video preview](https://github.com/hou-yz/dvgformer/raw/main/assets/videos/teaser.gif)

Abstract: *This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting tower and buildings, which are very useful for recording high-quality videos.*

## Installation
Please refer to our [GitHub repo](https://github.com/hou-yz/dvgformer) for detailed installation steps.

## Usage
Assuming you have followed the Installation Steps from [GitHub repo](https://github.com/hou-yz/dvgformer), you can use the following code to load the model checkpoint
```python
import torch
from src.models import DVGFormerModel

model = DVGFormerModel.from_pretrained(
    'yunzhong-hou/DVGFormer'
    ).to('cuda').to(torch.bfloat16)
```

For blender evaluation, you can run the following script from the GitHub repo.
```sh
python blender_eval.py 
```

## Citation
If you find this project useful, please consider citing:
```
@article{hou2024dvgformer,
  author    = {Hou, Yunzhong and Zheng, Liang and Torr, Philip},
  title     = {Learning Camera Movement Control from Real-World Drone Videos},
  journal   = {arXiv preprint},
  year      = {2024},
}
```