shisheng7 commited on
Commit
6f0f897
•
1 Parent(s): bd6ff99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -19,4 +19,4 @@ pipeline_tag: image-to-video
19
 
20
  ## 📖 Introduction
21
 
22
- In the field of speech-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and Mandarin's complex lip shapes further complicate model training compared to English. Our research involved collecting 29 hours of Mandarin speech video from employees at JD Health International Inc., resulting in the jdh-Hallo dataset. This dataset features a wide range of ages and speaking styles, including both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we utilized the Chinese-wav2vec 2.0 model for audio feature embedding. Additionally, we enhanced the Hierarchical Audio-Driven Visual Synthesis module by integrating a Cross Attention mechanism, which aggregates information from lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. The moderate coupling of information enables the model to learn relationships between facial features, addressing issues of unnatural appearance. These advancements lead to more precise alignment between audio inputs and visual outputs, enhancing the quality and realism of synthesized videos. It is noteworthy that JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities.
 
19
 
20
  ## 📖 Introduction
21
 
22
+ In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities.