fffiloni commited on
Commit
ea31508
·
verified ·
1 Parent(s): c673f60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -8,5 +8,73 @@ sdk_version: 4.39.0
8
  app_file: app.py
9
  pinned: false
10
  ---
 
 
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+ # Video-to-Audio Generation with Hidden Alignment
12
+ Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu
13
+ Tencent AI Lab
14
 
15
+ <a href='https://arxiv.org/abs/2407.07464'>
16
+ <img src='https://img.shields.io/badge/Paper-Arxiv-green?style=plastic&logo=arXiv&logoColor=green' alt='Paper Arxiv'>
17
+ </a>
18
+ <a href='https://sites.google.com/view/vta-ldm/home'>
19
+ <img src='https://img.shields.io/badge/Project-Page-blue?style=plastic&logo=Google%20chrome&logoColor=blue' alt='Project Page'>
20
+ </a>
21
+
22
+ Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. We aim to offer insights into the video-to-audio generation paradigm.
23
+
24
+ ## Install
25
+ First install the python requirements. We recommend using conda:
26
+
27
+ ```
28
+ conda create -n vta-ldm python=3.10
29
+ conda activate vta-ldm
30
+ pip install -r requirements.txt
31
+ ```
32
+ Then download the checkpoints from [huggingface](https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large), we recommend using git lfs:
33
+ ```
34
+ mkdir ckpt && cd ckpt
35
+ git clone https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large
36
+ # pull if large files are skipped:
37
+ cd vta-ldm-clip4clip-v-large && git lfs pull
38
+ ```
39
+
40
+ ## Model List
41
+ - ✅ VTA_LDM (the base model)
42
+ - 🕳️ VTA_LDM+IB/LB/CAVP/VIVIT
43
+ - 🕳️ VTA_LDM+text
44
+ - 🕳️ VTA_LDM+PE
45
+ - 🕳️ VTA_LDM+text+concat
46
+ - 🕳️ VTA_LDM+pretrain+text+concat
47
+
48
+ ## Inference
49
+ Put the video pieces into the `data` directory. Run the provided inference script to generate audio content from the input videos:
50
+ ```
51
+ bash inference_from_video.sh
52
+ ```
53
+ You can custom the hyperparameters to fit your personal requirements. We also provide a script that can help merge the generated audio content with the original video based on ffmpeg:
54
+
55
+ ```
56
+ bash tools/merge_video_audio
57
+ ```
58
+ ## Training
59
+ TBD. Code Coming Soon.
60
+
61
+ ## Ack
62
+ This work is based on some of the great repos:
63
+ [diffusers](https://github.com/huggingface/diffusers)
64
+ [Tango](https://github.com/declare-lab/tango)
65
+ [Audioldm](https://github.com/haoheliu/AudioLDM)
66
+
67
+ ## Cite us
68
+ ```
69
+ @misc{xu2024vta-ldm,
70
+ title={Video-to-Audio Generation with Hidden Alignment},
71
+ author={Manjie Xu and Chenxing Li and Yong Ren and Rilin Chen and Yu Gu and Wei Liang and Dong Yu},
72
+ year={2024},
73
+ eprint={2407.07464},
74
+ archivePrefix={arXiv},
75
+ url={https://arxiv.org/abs/2407.07464},
76
+ }
77
+ ```
78
+ ## Disclaimer
79
+
80
+ This is not an official product by Tencent Ltd.