Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -8,5 +8,73 @@ sdk_version: 4.39.0
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
|
|
|
|
|
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
11 |
+
# Video-to-Audio Generation with Hidden Alignment
|
12 |
+
Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu
|
13 |
+
Tencent AI Lab
|
14 |
|
15 |
+
<a href='https://arxiv.org/abs/2407.07464'>
|
16 |
+
<img src='https://img.shields.io/badge/Paper-Arxiv-green?style=plastic&logo=arXiv&logoColor=green' alt='Paper Arxiv'>
|
17 |
+
</a>
|
18 |
+
<a href='https://sites.google.com/view/vta-ldm/home'>
|
19 |
+
<img src='https://img.shields.io/badge/Project-Page-blue?style=plastic&logo=Google%20chrome&logoColor=blue' alt='Project Page'>
|
20 |
+
</a>
|
21 |
+
|
22 |
+
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. We aim to offer insights into the video-to-audio generation paradigm.
|
23 |
+
|
24 |
+
## Install
|
25 |
+
First install the python requirements. We recommend using conda:
|
26 |
+
|
27 |
+
```
|
28 |
+
conda create -n vta-ldm python=3.10
|
29 |
+
conda activate vta-ldm
|
30 |
+
pip install -r requirements.txt
|
31 |
+
```
|
32 |
+
Then download the checkpoints from [huggingface](https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large), we recommend using git lfs:
|
33 |
+
```
|
34 |
+
mkdir ckpt && cd ckpt
|
35 |
+
git clone https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large
|
36 |
+
# pull if large files are skipped:
|
37 |
+
cd vta-ldm-clip4clip-v-large && git lfs pull
|
38 |
+
```
|
39 |
+
|
40 |
+
## Model List
|
41 |
+
- ✅ VTA_LDM (the base model)
|
42 |
+
- 🕳️ VTA_LDM+IB/LB/CAVP/VIVIT
|
43 |
+
- 🕳️ VTA_LDM+text
|
44 |
+
- 🕳️ VTA_LDM+PE
|
45 |
+
- 🕳️ VTA_LDM+text+concat
|
46 |
+
- 🕳️ VTA_LDM+pretrain+text+concat
|
47 |
+
|
48 |
+
## Inference
|
49 |
+
Put the video pieces into the `data` directory. Run the provided inference script to generate audio content from the input videos:
|
50 |
+
```
|
51 |
+
bash inference_from_video.sh
|
52 |
+
```
|
53 |
+
You can custom the hyperparameters to fit your personal requirements. We also provide a script that can help merge the generated audio content with the original video based on ffmpeg:
|
54 |
+
|
55 |
+
```
|
56 |
+
bash tools/merge_video_audio
|
57 |
+
```
|
58 |
+
## Training
|
59 |
+
TBD. Code Coming Soon.
|
60 |
+
|
61 |
+
## Ack
|
62 |
+
This work is based on some of the great repos:
|
63 |
+
[diffusers](https://github.com/huggingface/diffusers)
|
64 |
+
[Tango](https://github.com/declare-lab/tango)
|
65 |
+
[Audioldm](https://github.com/haoheliu/AudioLDM)
|
66 |
+
|
67 |
+
## Cite us
|
68 |
+
```
|
69 |
+
@misc{xu2024vta-ldm,
|
70 |
+
title={Video-to-Audio Generation with Hidden Alignment},
|
71 |
+
author={Manjie Xu and Chenxing Li and Yong Ren and Rilin Chen and Yu Gu and Wei Liang and Dong Yu},
|
72 |
+
year={2024},
|
73 |
+
eprint={2407.07464},
|
74 |
+
archivePrefix={arXiv},
|
75 |
+
url={https://arxiv.org/abs/2407.07464},
|
76 |
+
}
|
77 |
+
```
|
78 |
+
## Disclaimer
|
79 |
+
|
80 |
+
This is not an official product by Tencent Ltd.
|