Spaces:

naonauno
/

dialogs2-factory

Paused

App Files Files Community

naonauno commited on 4 days ago

Commit

d66c48f

verified ·

1 Parent(s): e937781

Upload 855 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +6 -0
Amphion/.github/CODE_OF_CONDUCT.md +132 -0
Amphion/.github/CONTRIBUTING.md +77 -0
Amphion/.github/ISSUE_TEMPLATE/bug_report.md +32 -0
Amphion/.github/ISSUE_TEMPLATE/docs_feedback.md +17 -0
Amphion/.github/ISSUE_TEMPLATE/feature_request.md +20 -0
Amphion/.github/ISSUE_TEMPLATE/help_wanted.md +32 -0
Amphion/.github/pull_request_template.md +32 -0
Amphion/.github/workflows/check_format.yml +12 -0
Amphion/.gitignore +64 -0
Amphion/Dockerfile +64 -0
Amphion/LICENSE +21 -0
Amphion/README.md +192 -0
Amphion/bins/calc_metrics.py +268 -0
Amphion/bins/codec/inference.py +99 -0
Amphion/bins/codec/train.py +79 -0
Amphion/bins/svc/inference.py +265 -0
Amphion/bins/svc/preprocess.py +183 -0
Amphion/bins/svc/train.py +111 -0
Amphion/bins/tta/inference.py +94 -0
Amphion/bins/tta/preprocess.py +195 -0
Amphion/bins/tta/train_tta.py +77 -0
Amphion/bins/tts/inference.py +169 -0
Amphion/bins/tts/preprocess.py +244 -0
Amphion/bins/tts/train.py +152 -0
Amphion/bins/vc/Noro/train.py +82 -0
Amphion/bins/vocoder/inference.py +115 -0
Amphion/bins/vocoder/preprocess.py +151 -0
Amphion/bins/vocoder/train.py +93 -0
Amphion/config/audioldm.json +92 -0
Amphion/config/autoencoderkl.json +69 -0
Amphion/config/base.json +185 -0
Amphion/config/comosvc.json +215 -0
Amphion/config/facodec.json +67 -0
Amphion/config/fs2.json +120 -0
Amphion/config/jets.json +120 -0
Amphion/config/noro.json +76 -0
Amphion/config/ns2.json +88 -0
Amphion/config/svc/base.json +119 -0
Amphion/config/svc/diffusion.json +142 -0
Amphion/config/transformer.json +179 -0
Amphion/config/tts.json +25 -0
Amphion/config/valle.json +55 -0
Amphion/config/vits.json +101 -0
Amphion/config/vitssvc.json +306 -0
Amphion/config/vocoder.json +84 -0
Amphion/egs/codec/FAcodec/README.md +51 -0
Amphion/egs/codec/FAcodec/exp_custom_data.json +80 -0
Amphion/egs/codec/FAcodec/train.sh +27 -0
Amphion/egs/datasets/README.md +458 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+Amphion/imgs/vc/NoroVC.png filter=lfs diff=lfs merge=lfs -text
+Amphion/imgs/vocoder/gan/MSSBCQTD.png filter=lfs diff=lfs merge=lfs -text
+Amphion/models/codec/facodec/modules/JDC/bst.t7 filter=lfs diff=lfs merge=lfs -text
+Amphion/models/tts/maskgct/g2p/sources/chinese_lexicon.txt filter=lfs diff=lfs merge=lfs -text
+Amphion/models/tts/maskgct/wav/prompt.wav filter=lfs diff=lfs merge=lfs -text
+Amphion/visualization/SingVisio/System_Introduction_of_SingVisio_V2.pdf filter=lfs diff=lfs merge=lfs -text

Amphion/.github/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# Contributor Covenant Code of Conduct
+## Our Pledge
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, caste, color, religion, or sexual
+identity and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+## Our Standards
+Examples of behavior that contributes to a positive environment for our
+community include:
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the overall
+  community
+Examples of unacceptable behavior include:
+* The use of sexualized language or imagery, and sexual attention or advances of
+  any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email address,
+  without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Enforcement Responsibilities
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+## Scope
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official email address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement.
+All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+## Enforcement Guidelines
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+### 1. Correction
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+### 2. Warning
+**Community Impact**: A violation through a single incident or series of
+actions.
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or permanent
+ban.
+### 3. Temporary Ban
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+### 4. Permanent Ban
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior, harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+**Consequence**: A permanent ban from any sort of public interaction within the
+community.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.1, available at
+[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
+Community Impact Guidelines were inspired by
+[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
+For answers to common questions about this code of conduct, see the FAQ at
+[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
+[https://www.contributor-covenant.org/translations][translations].
+[homepage]: https://www.contributor-covenant.org
+[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
+[Mozilla CoC]: https://github.com/mozilla/diversity
+[FAQ]: https://www.contributor-covenant.org/faq
+[translations]: https://www.contributor-covenant.org/translations

Amphion/.github/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Welcome to the Amphion Community!
+We greatly appreciate your interest in contributing to Amphion. Your involvement plays a pivotal role in our collective growth, and we are dedicated to nurturing a cooperative and inclusive space for all contributors. To ensure a respectful and productive atmosphere, all contributors must adhere to the Amphion [Code of Conduct](CODE_OF_CONDUCT.md).
+## Contributions
+All kinds of contributions are welcome, including but not limited to:
+- **Issue Reporting**: Report bugs or suggest features through GitHub Issues.
+- **Bug Fixes**: Identify and rectify software issues to boost functionality.
+- **Developing New Features**: Bring innovation and impactful enhancements to Amphion.
+- **Implementing New Checkpoints**: Introduce checkpoints to optimize workflows.
+- **Recipe Contributions**: Share your unique and practical coding solutions.
+- **Diverse Contributions**: Your participation isn't limited! Contribute to documentation, community support, and more.
+## How to Contribute
+1. **Fork the Repository**: Start by forking the Amphion repository on GitHub.
+2. **Clone Your Fork**: Localize your fork on your development machine.
+3. **Create a Branch**: Initiate a new branch for your changes.
+4. **Test Your Changes**: Ensure compatibility and non-disruption of your updates.
+5. **Commit Your Changes**: Make small, focused commits with clear descriptions.
+6. **Update Your Fork**: Upload your modifications to your GitHub fork.
+7. **Open a Pull Request**: Suggest a pull request from your fork to the main Amphion repository with our [Pull Request Template](pull_request_template.md).
+8. **Participate in Code Reviews**: Collaborate with reviewers and address their feedback.
+## Coding Standards
+- **License Headers**: Each new code file should include license headers.
+- **Style Consistency**: Align with the project's existing coding style.
+- **Code Quality**: Aim for clarity, maintainability, and efficiency.
+- **Clear Commenting**: Describe the purpose and usage of each function and other crucial code segments.
+- **Code Formatting**:
+  - Install 'black' formatter: `pip install black`.
+  - Format files: `black file.py`.
+  - Format directories: `black directory/`.
+## Contributor Agreement
+By contributing to Amphion, you agree to abide by our Code of Conduct, and the Developer Certificate of Origin, Version 1.1:
+```
+Developer Certificate of Origin
+Version 1.1
+Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+Everyone is permitted to copy and distribute verbatim copies of this
+license document, but changing it is not allowed.
+Developer's Certificate of Origin 1.1
+By making a contribution to this project, I certify that:
+(a) The contribution was created in whole or in part by me and I
+    have the right to submit it under the open source license
+    indicated in the file; or
+(b) The contribution is based upon previous work that, to the best
+    of my knowledge, is covered under an appropriate open source
+    license and I have the right under that license to submit that
+    work with modifications, whether created in whole or in part
+    by me, under the same open source license (unless I am
+    permitted to submit under a different license), as indicated
+    in the file; or
+(c) The contribution was provided directly to me by some other
+    person who certified (a), (b) or (c) and I have not modified
+    it.
+(d) I understand and agree that this project and the contribution
+    are public and that a record of the contribution (including all
+    personal information I submit with it, including my sign-off) is
+    maintained indefinitely and may be redistributed consistent with
+    this project or the open source license(s) involved.
+```
+## Need Help?
+For any queries or support, feel free to open an issue for community discussions and help.

Amphion/.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

	@@ -0,0 +1,32 @@

+---
+name: Bug report
+about: Create a report to help us improve Amphion.
+title: "[BUG]: "
+labels: 'bug'
+assignees: ''
+---
+## Describe the bug
+(A clear and concise description of what the bug is.)
+## How To Reproduce
+Steps to reproduce the behavior:
+1. Config/File changes: ...
+2. Run command: ...
+3. See error: ...
+## Expected behavior
+(A clear and concise description of what you expected to happen.)
+## Screenshots
+(If applicable, add screenshots to help explain your problem.)
+## Environment Information
+ - Operating System: [e.g. Ubuntu 20.04.5 LTS]
+ - Python Version: [e.g. Python 3.9.15]
+ - Driver & CUDA Version: [e.g. Driver 470.103.01 & CUDA 11.4]
+ - Error Messages and Logs: [If applicable, provide any error messages or relevant log outputs]
+## Additional context
+(Add any other context about the problem here.)

Amphion/.github/ISSUE_TEMPLATE/docs_feedback.md ADDED Viewed

	@@ -0,0 +1,17 @@

+---
+name: Docs feedback
+about: Improve documentation about Amphion.
+title: "[Docs]: "
+labels: 'documentation'
+assignees: ''
+---
+## Documentation Reference
+(Path/Link to the documentation file)
+## Feedback on documentation
+(Your suggestions to the documentation. e.g., accuracy, complex explanations, structural organization, practical examples, technical reliability, and consistency)
+## Additional context
+(Add any other context or screenshots about the documentation here.)

Amphion/.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

	@@ -0,0 +1,20 @@

+---
+name: Feature request
+about: Suggest an idea for Amphion.
+title: "[Feature]: "
+labels: 'enhancement'
+assignees: ''
+---
+## Is your feature request related to a problem? Please describe.
+(A clear and concise description of what the problem is.)
+## Describe the solution you'd like
+(A clear and concise description of what you want to happen.)
+## Describe alternatives you've considered
+(A clear and concise description of any alternative solutions or features you've considered.)
+## Additional context
+(Add any other context or screenshots about the feature request here.)

Amphion/.github/ISSUE_TEMPLATE/help_wanted.md ADDED Viewed

	@@ -0,0 +1,32 @@

+---
+name: Help wanted
+about: Want help from Amphion team.
+title: "[Help]: "
+labels: 'help wanted'
+assignees: ''
+---
+## Problem Overview
+(Briefly and clearly describe the issue you're facing and seeking help with.)
+## Steps Taken
+(Detail your attempts to resolve the issue, including any relevant steps or processes.)
+1. Config/File changes: ...
+2. Run command: ...
+3. See errors: ...
+## Expected Outcome
+(A clear and concise description of what you expected to happen.)
+## Screenshots
+(If applicable, add screenshots to help explain your problem.)
+## Environment Information
+ - Operating System: [e.g. Ubuntu 20.04.5 LTS]
+ - Python Version: [e.g. Python 3.9.15]
+ - Driver & CUDA Version: [e.g. Driver 470.103.01 & CUDA 11.4]
+ - Error Messages and Logs: [If applicable, provide any error messages or relevant log outputs]
+## Additional context
+(Add any other context about the problem here.)

Amphion/.github/pull_request_template.md ADDED Viewed

	@@ -0,0 +1,32 @@

+## ✨ Description
+[Please describe the background, purpose, changes made, and how to test this PR]
+## 🚧 Related Issues
+[List the issue numbers related to this PR]
+## 👨‍💻 Changes Proposed
+- [ ] change1
+- [ ] ...
+## 🧑‍🤝‍🧑 Who Can Review?
+[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
+## 🛠 TODO
+- [ ] task1
+- [ ] ...
+## ✅ Checklist
+- [ ]  Code has been reviewed
+- [ ]  Code complies with the project's code standards and best practices
+- [ ]  Code has passed all tests
+- [ ]  Code does not affect the normal use of existing features
+- [ ]  Code has been commented properly
+- [ ]  Documentation has been updated (if applicable)
+- [ ]  Demo/checkpoint has been attached (if applicable)

Amphion/.github/workflows/check_format.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+name: Check Format
+on: [push, pull_request]
+jobs:
+  CheckCodeFormat:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: psf/black@stable
+        with:
+          options: "--check --diff --color"

Amphion/.gitignore ADDED Viewed

	@@ -0,0 +1,64 @@

+# Mac OS files
+.DS_Store
+# IDEs
+.idea
+.vs
+.vscode
+.cache
+pyrightconfig.json
+# GitHub files
+.github
+# Byte-compiled / optimized / DLL / cached files
+__pycache__/
+*.py[cod]
+*$py.class
+*.pyc
+.temp
+*.c
+*.so
+*.o
+# Developing mode
+_*.sh
+_*.json
+*.lst
+yard*
+*.out
+evaluation/evalset_selection
+mfa
+egs/svc/*wavmark
+egs/svc/custom
+egs/svc/*/dev*
+egs/svc/dev_exp_config.json
+egs/svc/dev
+bins/svc/demo*
+bins/svc/preprocess_custom.py
+data
+ckpts
+# Data and ckpt
+*.pkl
+*.pt
+*.npy
+*.npz
+*.tar.gz
+*.ckpt
+*.wav
+*.flac
+pretrained/wenet/*conformer_exp
+pretrained/bigvgan/args.json
+!egs/tts/VALLE/prompt_examples/*.wav
+# Runtime data dirs
+processed_data
+data
+model_ckpt
+logs
+*.lst
+source_audio
+result
+conversion_results
+get_available_gpu.py

Amphion/Dockerfile ADDED Viewed

	@@ -0,0 +1,64 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Other version: https://hub.docker.com/r/nvidia/cuda/tags
+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.0.0'
+ARG CUDA='cu118'
+ARG SHELL='/bin/bash'
+ARG MINICONDA='Miniconda3-py39_23.3.1-0-Linux-x86_64.sh'
+ENV LANG=en_US.UTF-8 PYTHONIOENCODING=utf-8 PYTHONDONTWRITEBYTECODE=1 CUDA_HOME=/usr/local/cuda CONDA_HOME=/opt/conda SHELL=${SHELL}
+ENV PATH=$CONDA_HOME/bin:$CUDA_HOME/bin:$PATH \
+    LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH \
+    LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH \
+    CONDA_PREFIX=$CONDA_HOME \
+    NCCL_HOME=$CUDA_HOME
+# Install ubuntu packages
+RUN sed -i 's/archive.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
+    && sed -i 's/security.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
+    && rm /etc/apt/sources.list.d/cuda.list \
+    && apt-get update \
+    && apt-get -y install \
+    python3-pip ffmpeg git less wget libsm6 libxext6 libxrender-dev \
+    build-essential cmake pkg-config libx11-dev libatlas-base-dev \
+    libgtk-3-dev libboost-python-dev vim libgl1-mesa-glx \
+    libaio-dev software-properties-common tmux \
+    espeak-ng
+# Install miniconda with python 3.9
+USER root
+# COPY Miniconda3-py39_23.3.1-0-Linux-x86_64.sh /root/anaconda.sh
+RUN wget -t 0 -c -O /tmp/anaconda.sh https://repo.anaconda.com/miniconda/${MINICONDA} \
+    && mv /tmp/anaconda.sh /root/anaconda.sh \
+    && ${SHELL} /root/anaconda.sh -b -p $CONDA_HOME \
+    && rm /root/anaconda.sh
+RUN conda create -y --name amphion python=3.9.15
+WORKDIR /app
+COPY env.sh env.sh
+RUN chmod +x ./env.sh
+RUN ["conda", "run", "-n", "amphion", "-vvv", "--no-capture-output", "./env.sh"]
+RUN conda init \
+    && echo "\nconda activate amphion\n" >> ~/.bashrc
+CMD ["/bin/bash"]
+# *** Build ***
+# docker build -t realamphion/amphion .
+# *** Run ***
+# cd Amphion
+# docker run --runtime=nvidia --gpus all -it -v .:/app -v /mnt:/mnt_host realamphion/amphion
+# *** Push and release ***
+# docker login
+# docker push realamphion/amphion

Amphion/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Amphion
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

Amphion/README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+# Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit
+<div>
+    <a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
+    <a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
+    <a href="https://modelscope.cn/organization/amphion"><img src="https://img.shields.io/badge/ModelScope-Amphion-cyan"></a>
+    <a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
+    <a href="https://discord.com/invite/drhW7ajqAG"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg"></a>
+    <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
+    <a href="models/vc/vevo/README.md"><img src="https://img.shields.io/badge/README-VC-blue"></a>
+    <a href="models/vc/vevo/README.md"><img src="https://img.shields.io/badge/README-AC-blue"></a>
+    <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
+    <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
+    <a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a>
+    <a href="egs/metrics/README.md"><img src="https://img.shields.io/badge/README-Evaluation-yellow"></a>
+    <a href="LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-red"></a>
+    <a href="https://trendshift.io/repositories/5469" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5469" alt="open-mmlab%2FAmphion | Trendshift" style="width: 150px; height: 33px;" width="150" height="33"/></a>
+</div>
+<br>
+**Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation.** Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: **visualizations** of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
+**The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio.** Amphion is designed to support individual generation tasks, including but not limited to,
+- **TTS**: Text to Speech (⛳ supported)
+- **SVS**: Singing Voice Synthesis (👨‍💻 developing)
+- **VC**: Voice Conversion (⛳ supported)
+- **AC**: Accent Conversion (⛳ supported)
+- **SVC**: Singing Voice Conversion (⛳ supported)
+- **TTA**: Text to Audio (⛳ supported)
+- **TTM**: Text to Music (👨‍💻 developing)
+- more…
+In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
+## 🚀 News
+- **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
+- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance.  [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
+- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
+- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/drhW7ajqAG) to stay connected and engage with our community！
+- **2024/08/20**: [SingVisio](https://arxiv.org/abs/2402.12660) got accepted by Computers & Graphics, [available here](https://www.sciencedirect.com/science/article/pii/S0097849324001936)! 🎉
+- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
+- **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
+- **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
+- **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
+- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
+- **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
+- **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)
+## ⭐ Key Features
+### TTS: Text to Speech
+- Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
+    - [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/FastSpeech2/README.md)
+    - [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/VITS/README.md)
+    - [VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/VALLE_V2/README.md)
+    - [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/NaturalSpeech2/README.md)
+    - [Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/Jets/README.md)
+    - [MaskGCT](https://arxiv.org/abs/2409.00750): A fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision. [![code](https://img.shields.io/badge/README-Code-blue)](models/tts/maskgct/README.md)
+    - [Vevo-TTS](https://openreview.net/pdf?id=anQDiQZhDP): A zero-shot TTS architecture with controllable timbre and style. It consists of an autoregressive transformer and a flow-matching transformer. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)
+### VC: Voice Conversion
+Amphion supports the following voice conversion models:
+- [Vevo](https://openreview.net/pdf?id=anQDiQZhDP): A zero-shot voice imitation framework with controllable timbre and style. **Vevo-Timbre** conducts the style-preserved voice conversion, and **Vevo-Voice** conducts the style-converted voice conversion. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)
+- [FACodec](https://arxiv.org/abs/2403.03100): FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. It can achieve zero-shot voice conversion. [![code](https://img.shields.io/badge/README-Code-blue)](https://huggingface.co/amphion/naturalspeech3_facodec)
+- [Noro](https://arxiv.org/abs/2411.19770): A **noise-robust** zero-shot voice conversion system. Noro introduces innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. [![code](https://img.shields.io/badge/README-Code-blue)](egs/vc/Noro/README.md)
+### AC: Accent Conversion
+- Amphion supports AC with [Vevo-Style](models/vc/vevo/README.md). Particularly, it can conduct the accent conversion in a zero-shot manner. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)
+### SVC: Singing Voice Conversion
+- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-blue)](egs/svc/MultipleContentsSVC)
+- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv). [![code](https://img.shields.io/badge/README-Code-blue)](egs/svc/DiffComoSVC/README.md)
+### TTA: Text to Audio
+- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-blue)](egs/tta/RECIPE.md)
+### Vocoder
+- Amphion supports various widely-used neural vocoders, including:
+    - GAN-based vocoders: [MelGAN](https://arxiv.org/abs/1910.06711), [HiFi-GAN](https://arxiv.org/abs/2010.05646), [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts), [BigVGAN](https://arxiv.org/abs/2206.04658), [APNet](https://arxiv.org/abs/2305.07952).
+    - Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
+    - Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
+    - Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
+- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-blue)](egs/vocoder/gan/tfr_enhanced_hifigan)
+### Evaluation
+Amphion provides a comprehensive objective evaluation of the generated audio. [![code](https://img.shields.io/badge/README-Code-blue)](egs/metrics/README.md)
+The supported evaluation metrics contain:
+- **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
+- **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
+- **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more.
+- **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
+- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [Resemblyzer](https://github.com/resemble-ai/Resemblyzer), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), [WavLM](https://github.com/microsoft/unilm/tree/master/wavlm) and more.
+### Datasets
+- Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
+- Amphion (exclusively) supports the [**Emilia**](preprocessors/Emilia/README.md) dataset and its preprocessing pipeline **Emilia-Pipe** for in-the-wild speech data!
+### Visualization
+Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.
+Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)
+## 📀 Installation
+Amphion can be installed through either Setup Installer or Docker Image.
+### Setup Installer
+```bash
+git clone https://github.com/open-mmlab/Amphion.git
+cd Amphion
+# Install Python Environment
+conda create --name amphion python=3.9.15
+conda activate amphion
+# Install Python Packages Dependencies
+sh env.sh
+```
+### Docker Image
+1. Install [Docker](https://docs.docker.com/get-docker/), [NVIDIA Driver](https://www.nvidia.com/download/index.aspx), [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and [CUDA](https://developer.nvidia.com/cuda-downloads).
+2. Run the following commands:
+```bash
+git clone https://github.com/open-mmlab/Amphion.git
+cd Amphion
+docker pull realamphion/amphion
+docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
+```
+Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.
+## 🐍 Usage in Python
+We detail the instructions of different tasks in the following recipes:
+- [Text to Speech (TTS)](egs/tts/README.md)
+- [Voice Conversion (VC)](models/vc/vevo/README.md)
+- [Accent Conversion (AC)](models/vc/vevo/README.md)
+- [Singing Voice Conversion (SVC)](egs/svc/README.md)
+- [Text to Audio (TTA)](egs/tta/README.md)
+- [Vocoder](egs/vocoder/README.md)
+- [Evaluation](egs/metrics/README.md)
+- [Visualization](egs/visualization/README.md)
+## 👨‍💻 Contributing
+We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
+## 🙏 Acknowledgement
+- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) and [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) for model architecture code.
+- [lifeiteng's VALL-E](https://github.com/lifeiteng/vall-e) for training pipeline and model architecture design.
+- [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) for semantic-distilled tokenizer design.
+- [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code.
+- [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy.
+- [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks.
+- [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design.
+- [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools.
+## ©️ License
+Amphion is under the [MIT License](LICENSE). It is free for both research and commercial use cases.
+## 📚 Citations
+```bibtex
+@inproceedings{amphion,
+    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
+    year={2024}
+}
+```

Amphion/bins/calc_metrics.py ADDED Viewed

	@@ -0,0 +1,268 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import sys
+import numpy as np
+import json
+import argparse
+import whisper
+import torch
+from glob import glob
+from tqdm import tqdm
+from collections import defaultdict
+from evaluation.metrics.energy.energy_rmse import extract_energy_rmse
+from evaluation.metrics.energy.energy_pearson_coefficients import (
+    extract_energy_pearson_coeffcients,
+)
+from evaluation.metrics.f0.f0_pearson_coefficients import extract_fpc
+from evaluation.metrics.f0.f0_periodicity_rmse import extract_f0_periodicity_rmse
+from evaluation.metrics.f0.f0_rmse import extract_f0rmse
+from evaluation.metrics.f0.v_uv_f1 import extract_f1_v_uv
+from evaluation.metrics.intelligibility.character_error_rate import extract_cer
+from evaluation.metrics.intelligibility.word_error_rate import extract_wer
+from evaluation.metrics.similarity.speaker_similarity import extract_similarity
+from evaluation.metrics.spectrogram.frechet_distance import extract_fad
+from evaluation.metrics.spectrogram.mel_cepstral_distortion import extract_mcd
+from evaluation.metrics.spectrogram.multi_resolution_stft_distance import extract_mstft
+from evaluation.metrics.spectrogram.pesq import extract_pesq
+from evaluation.metrics.spectrogram.scale_invariant_signal_to_distortion_ratio import (
+    extract_si_sdr,
+)
+from evaluation.metrics.spectrogram.scale_invariant_signal_to_noise_ratio import (
+    extract_si_snr,
+)
+from evaluation.metrics.spectrogram.short_time_objective_intelligibility import (
+    extract_stoi,
+)
+METRIC_FUNC = {
+    "energy_rmse": extract_energy_rmse,
+    "energy_pc": extract_energy_pearson_coeffcients,
+    "fpc": extract_fpc,
+    "f0_periodicity_rmse": extract_f0_periodicity_rmse,
+    "f0rmse": extract_f0rmse,
+    "v_uv_f1": extract_f1_v_uv,
+    "cer": extract_cer,
+    "wer": extract_wer,
+    "similarity": extract_similarity,
+    "fad": extract_fad,
+    "mcd": extract_mcd,
+    "mstft": extract_mstft,
+    "pesq": extract_pesq,
+    "si_sdr": extract_si_sdr,
+    "si_snr": extract_si_snr,
+    "stoi": extract_stoi,
+}
+def calc_metric(
+    ref_dir,
+    deg_dir,
+    dump_dir,
+    metrics,
+    **kwargs,
+):
+    result = defaultdict()
+    for metric in tqdm(metrics):
+        if metric in ["fad", "similarity"]:
+            result[metric] = str(METRIC_FUNC[metric](ref_dir, deg_dir, kwargs=kwargs))
+            continue
+        audios_ref = []
+        audios_deg = []
+        files = glob(deg_dir + "/*.wav")
+        for file in files:
+            audios_deg.append(file)
+            uid = file.split("/")[-1].split(".wav")[0]
+            file_gt = ref_dir + "/{}.wav".format(uid)
+            audios_ref.append(file_gt)
+        if metric in ["wer", "cer"] and kwargs["intelligibility_mode"] == "gt_content":
+            ltr_path = kwargs["ltr_path"]
+            tmpltrs = {}
+            with open(ltr_path, "r") as f:
+                for line in f:
+                    paras = line.replace("\n", "").split("|")
+                    paras[1] = paras[1].replace(" ", "")
+                    paras[1] = paras[1].replace(".", "")
+                    paras[1] = paras[1].replace("'", "")
+                    paras[1] = paras[1].replace("-", "")
+                    paras[1] = paras[1].replace(",", "")
+                    paras[1] = paras[1].replace("!", "")
+                    paras[1] = paras[1].lower()
+                    tmpltrs[paras[0]] = paras[1]
+            ltrs = []
+            files = glob(ref_dir + "/*.wav")
+            for file in files:
+                ltrs.append(tmpltrs[os.path.basename(file)])
+        if metric in ["v_uv_f1"]:
+            tp_total = 0
+            fp_total = 0
+            fn_total = 0
+            for i in tqdm(range(len(audios_ref))):
+                audio_ref = audios_ref[i]
+                audio_deg = audios_deg[i]
+                tp, fp, fn = METRIC_FUNC[metric](audio_ref, audio_deg, kwargs=kwargs)
+                tp_total += tp
+                fp_total += fp
+                fn_total += fn
+            result[metric] = str(tp_total / (tp_total + (fp_total + fn_total) / 2))
+        else:
+            scores = []
+            for i in tqdm(range(len(audios_ref))):
+                audio_ref = audios_ref[i]
+                audio_deg = audios_deg[i]
+                if metric in ["wer", "cer"]:
+                    model = whisper.load_model("large")
+                    mode = kwargs["intelligibility_mode"]
+                    if torch.cuda.is_available():
+                        device = torch.device("cuda")
+                        model = model.to(device)
+                    if mode == "gt_audio":
+                        kwargs["audio_ref"] = audio_ref
+                        kwargs["audio_deg"] = audio_deg
+                        score = METRIC_FUNC[metric](
+                            model,
+                            kwargs=kwargs,
+                        )
+                    elif mode == "gt_content":
+                        kwargs["content_gt"] = ltrs[i]
+                        kwargs["audio_deg"] = audio_deg
+                        score = METRIC_FUNC[metric](
+                            model,
+                            kwargs=kwargs,
+                        )
+                else:
+                    score = METRIC_FUNC[metric](
+                        audio_ref,
+                        audio_deg,
+                        kwargs=kwargs,
+                    )
+                if not np.isnan(score):
+                    scores.append(score)
+            scores = np.array(scores)
+            result["{}".format(metric)] = str(np.mean(scores))
+    data = json.dumps(result, indent=4)
+    with open(os.path.join(dump_dir, "result.json"), "w", newline="\n") as f:
+        f.write(data)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--ref_dir",
+        type=str,
+        help="Path to the reference audio folder.",
+    )
+    parser.add_argument(
+        "--deg_dir",
+        type=str,
+        help="Path to the test audio folder.",
+    )
+    parser.add_argument(
+        "--dump_dir",
+        type=str,
+        help="Path to dump the results.",
+    )
+    parser.add_argument(
+        "--metrics",
+        nargs="+",
+        help="Metrics used to evaluate.",
+    )
+    parser.add_argument(
+        "--fs",
+        type=str,
+        default="None",
+        help="(Optional) Sampling rate",
+    )
+    parser.add_argument(
+        "--align_method",
+        type=str,
+        default="dtw",
+        help="(Optional) Method for aligning feature length. ['cut', 'dtw']",
+    )
+    parser.add_argument(
+        "--db_scale",
+        type=str,
+        default="True",
+        help="(Optional) Wether or not computing energy related metrics in db scale.",
+    )
+    parser.add_argument(
+        "--f0_subtract_mean",
+        type=str,
+        default="True",
+        help="(Optional) Wether or not computing f0 related metrics with mean value subtracted.",
+    )
+    parser.add_argument(
+        "--similarity_model",
+        type=str,
+        default="wavlm",
+        help="(Optional)The model for computing speaker similarity. ['rawnet', 'wavlm', 'resemblyzer']",
+    )
+    parser.add_argument(
+        "--similarity_mode",
+        type=str,
+        default="pairwith",
+        help="(Optional)The method of calculating similarity, where set to overall means computing \
+        the speaker similarity between two folder of audios content freely, and set to pairwith means \
+        computing the speaker similarity between a seires of paired gt/pred audios",
+    )
+    parser.add_argument(
+        "--ltr_path",
+        type=str,
+        default="None",
+        help="(Optional)Path to the transcription file,Note that the format in the transcription \
+            file is 'file name|transcription'",
+    )
+    parser.add_argument(
+        "--intelligibility_mode",
+        type=str,
+        default="gt_audio",
+        help="(Optional)The method of calculating WER and CER, where set to gt_audio means selecting \
+        the recognition content of the reference audio as the target, and set to gt_content means \
+        using transcription as the target",
+    )
+    parser.add_argument(
+        "--language",
+        type=str,
+        default="english",
+        help="(Optional)['english','chinese']",
+    )
+    args = parser.parse_args()
+    calc_metric(
+        args.ref_dir,
+        args.deg_dir,
+        args.dump_dir,
+        args.metrics,
+        fs=int(args.fs) if args.fs != "None" else None,
+        method=args.align_method,
+        db_scale=True if args.db_scale == "True" else False,
+        need_mean=True if args.f0_subtract_mean == "True" else False,
+        model_name=args.similarity_model,
+        similarity_mode=args.similarity_mode,
+        ltr_path=args.ltr_path,
+        intelligibility_mode=args.intelligibility_mode,
+        language=args.language,
+    )

Amphion/bins/codec/inference.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+from argparse import ArgumentParser
+import os
+from models.codec.facodec.facodec_inference import FAcodecInference
+from utils.util import load_config
+import torch
+def build_inference(args, cfg):
+    supported_inference = {
+        "FAcodec": FAcodecInference,
+    }
+    inference_class = supported_inference[cfg.model_type]
+    inference = inference_class(args, cfg)
+    return inference
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def build_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="JSON/YAML file for configurations.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        type=str,
+        default=None,
+        help="Acoustic model checkpoint directory. If a directory is given, "
+        "search for the latest checkpoint dir in the directory. If a specific "
+        "checkpoint dir is given, directly load the checkpoint.",
+    )
+    parser.add_argument(
+        "--source",
+        type=str,
+        required=True,
+        help="Path to the source audio file",
+    )
+    parser.add_argument(
+        "--reference",
+        type=str,
+        default=None,
+        help="Path to the reference audio file, passing an",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Output dir for saving generated results",
+    )
+    return parser
+def main():
+    # Parse arguments
+    parser = build_parser()
+    args = parser.parse_args()
+    print(args)
+    # Parse config
+    cfg = load_config(args.config)
+    # CUDA settings
+    cuda_relevant()
+    # Build inference
+    inferencer = build_inference(args, cfg)
+    # Run inference
+    _ = inferencer.inference(args.source, args.output_dir)
+    # Run voice conversion
+    if args.reference is not None:
+        _ = inferencer.voice_conversion(args.source, args.reference, args.output_dir)
+if __name__ == "__main__":
+    main()

Amphion/bins/codec/train.py ADDED Viewed

	@@ -0,0 +1,79 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import torch
+from models.codec.facodec.facodec_trainer import FAcodecTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "FAcodec": FAcodecTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume_type",
+        type=str,
+        help="resume for continue to train, finetune for finetuning",
+    )
+    parser.add_argument(
+        "--checkpoint",
+        type=str,
+        help="checkpoint to resume",
+    )
+    parser.add_argument(
+        "--log_level", default="warning", help="logging level (debug, info, warning)"
+    )
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    # CUDA settings
+    cuda_relevant()
+    # Build trainer
+    trainer = build_trainer(args, cfg)
+    trainer.train_loop()
+if __name__ == "__main__":
+    main()

Amphion/bins/svc/inference.py ADDED Viewed

	@@ -0,0 +1,265 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import glob
+from tqdm import tqdm
+import json
+import torch
+import time
+from models.svc.diffusion.diffusion_inference import DiffusionInference
+from models.svc.comosvc.comosvc_inference import ComoSVCInference
+from models.svc.transformer.transformer_inference import TransformerInference
+from models.svc.vits.vits_inference import VitsInference
+from utils.util import load_config
+from utils.audio_slicer import split_audio, merge_segments_encodec
+from processors import acoustic_extractor, content_extractor
+def build_inference(args, cfg, infer_type="from_dataset"):
+    supported_inference = {
+        "DiffWaveNetSVC": DiffusionInference,
+        "DiffComoSVC": ComoSVCInference,
+        "TransformerSVC": TransformerInference,
+        "VitsSVC": VitsInference,
+    }
+    inference_class = supported_inference[cfg.model_type]
+    return inference_class(args, cfg, infer_type)
+def prepare_for_audio_file(args, cfg, num_workers=1):
+    preprocess_path = cfg.preprocess.processed_dir
+    audio_name = cfg.inference.source_audio_name
+    temp_audio_dir = os.path.join(preprocess_path, audio_name)
+    ### eval file
+    t = time.time()
+    eval_file = prepare_source_eval_file(cfg, temp_audio_dir, audio_name)
+    args.source = eval_file
+    with open(eval_file, "r") as f:
+        metadata = json.load(f)
+    print("Prepare for meta eval data: {:.1f}s".format(time.time() - t))
+    ### acoustic features
+    t = time.time()
+    acoustic_extractor.extract_utt_acoustic_features_serial(
+        metadata, temp_audio_dir, cfg
+    )
+    if cfg.preprocess.use_min_max_norm_mel == True:
+        acoustic_extractor.cal_mel_min_max(
+            dataset=audio_name, output_path=preprocess_path, cfg=cfg, metadata=metadata
+        )
+    acoustic_extractor.cal_pitch_statistics_svc(
+        dataset=audio_name, output_path=preprocess_path, cfg=cfg, metadata=metadata
+    )
+    print("Prepare for acoustic features: {:.1f}s".format(time.time() - t))
+    ### content features
+    t = time.time()
+    content_extractor.extract_utt_content_features_dataloader(
+        cfg, metadata, num_workers
+    )
+    print("Prepare for content features: {:.1f}s".format(time.time() - t))
+    return args, cfg, temp_audio_dir
+def merge_for_audio_segments(audio_files, args, cfg):
+    audio_name = cfg.inference.source_audio_name
+    target_singer_name = args.target_singer
+    merge_segments_encodec(
+        wav_files=audio_files,
+        fs=cfg.preprocess.sample_rate,
+        output_path=os.path.join(
+            args.output_dir, "{}_{}.wav".format(audio_name, target_singer_name)
+        ),
+        overlap_duration=cfg.inference.segments_overlap_duration,
+    )
+    for tmp_file in audio_files:
+        os.remove(tmp_file)
+def prepare_source_eval_file(cfg, temp_audio_dir, audio_name):
+    """
+    Prepare the eval file (json) for an audio
+    """
+    audio_chunks_results = split_audio(
+        wav_file=cfg.inference.source_audio_path,
+        target_sr=cfg.preprocess.sample_rate,
+        output_dir=os.path.join(temp_audio_dir, "wavs"),
+        max_duration_of_segment=cfg.inference.segments_max_duration,
+        overlap_duration=cfg.inference.segments_overlap_duration,
+    )
+    metadata = []
+    for i, res in enumerate(audio_chunks_results):
+        res["index"] = i
+        res["Dataset"] = audio_name
+        res["Singer"] = audio_name
+        res["Uid"] = "{}_{}".format(audio_name, res["Uid"])
+        metadata.append(res)
+    eval_file = os.path.join(temp_audio_dir, "eval.json")
+    with open(eval_file, "w") as f:
+        json.dump(metadata, f, indent=4, ensure_ascii=False, sort_keys=True)
+    return eval_file
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def infer(args, cfg, infer_type):
+    # Build inference
+    t = time.time()
+    trainer = build_inference(args, cfg, infer_type)
+    print("Model Init: {:.1f}s".format(time.time() - t))
+    # Run inference
+    t = time.time()
+    output_audio_files = trainer.inference()
+    print("Model inference: {:.1f}s".format(time.time() - t))
+    return output_audio_files
+def build_parser():
+    r"""Build argument parser for inference.py.
+    Anything else should be put in an extra config YAML file.
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="JSON/YAML file for configurations.",
+    )
+    parser.add_argument(
+        "--acoustics_dir",
+        type=str,
+        help="Acoustics model checkpoint directory. If a directory is given, "
+        "search for the latest checkpoint dir in the directory. If a specific "
+        "checkpoint dir is given, directly load the checkpoint.",
+    )
+    parser.add_argument(
+        "--vocoder_dir",
+        type=str,
+        required=True,
+        help="Vocoder checkpoint directory. Searching behavior is the same as "
+        "the acoustics one.",
+    )
+    parser.add_argument(
+        "--target_singer",
+        type=str,
+        required=True,
+        help="convert to a specific singer (e.g. --target_singers singer_id).",
+    )
+    parser.add_argument(
+        "--trans_key",
+        default=0,
+        help="0: no pitch shift; autoshift: pitch shift;  int: key shift.",
+    )
+    parser.add_argument(
+        "--source",
+        type=str,
+        default="source_audio",
+        help="Source audio file or directory. If a JSON file is given, "
+        "inference from dataset is applied. If a directory is given, "
+        "inference from all wav/flac/mp3 audio files in the directory is applied. "
+        "Default: inference from all wav/flac/mp3 audio files in ./source_audio",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="conversion_results",
+        help="Output directory. Default: ./conversion_results",
+    )
+    parser.add_argument(
+        "--log_level",
+        type=str,
+        default="warning",
+        help="Logging level. Default: warning",
+    )
+    parser.add_argument(
+        "--keep_cache",
+        action="store_true",
+        default=True,
+        help="Keep cache files. Only applicable to inference from files.",
+    )
+    parser.add_argument(
+        "--diffusion_inference_steps",
+        type=int,
+        default=1000,
+        help="Number of inference steps. Only applicable to diffusion inference.",
+    )
+    return parser
+def main():
+    ### Parse arguments and config
+    args = build_parser().parse_args()
+    cfg = load_config(args.config)
+    # CUDA settings
+    cuda_relevant()
+    if os.path.isdir(args.source):
+        ### Infer from file
+        # Get all the source audio files (.wav, .flac, .mp3)
+        source_audio_dir = args.source
+        audio_list = []
+        for suffix in ["wav", "flac", "mp3"]:
+            audio_list += glob.glob(
+                os.path.join(source_audio_dir, "**/*.{}".format(suffix)), recursive=True
+            )
+        print("There are {} source audios: ".format(len(audio_list)))
+        # Infer for every file as dataset
+        output_root_path = args.output_dir
+        for audio_path in tqdm(audio_list):
+            audio_name = audio_path.split("/")[-1].split(".")[0]
+            args.output_dir = os.path.join(output_root_path, audio_name)
+            print("\n{}\nConversion for {}...\n".format("*" * 10, audio_name))
+            cfg.inference.source_audio_path = audio_path
+            cfg.inference.source_audio_name = audio_name
+            cfg.inference.segments_max_duration = 10.0
+            cfg.inference.segments_overlap_duration = 1.0
+            # Prepare metadata and features
+            args, cfg, cache_dir = prepare_for_audio_file(args, cfg)
+            # Infer from file
+            output_audio_files = infer(args, cfg, infer_type="from_file")
+            # Merge the split segments
+            merge_for_audio_segments(output_audio_files, args, cfg)
+            # Keep or remove caches
+            if not args.keep_cache:
+                os.removedirs(cache_dir)
+    else:
+        ### Infer from dataset
+        infer(args, cfg, infer_type="from_dataset")
+if __name__ == "__main__":
+    main()

Amphion/bins/svc/preprocess.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import faulthandler
+faulthandler.enable()
+import os
+import argparse
+import json
+from multiprocessing import cpu_count
+from utils.util import load_config
+from preprocessors.processor import preprocess_dataset
+from preprocessors.metadata import cal_metadata
+from processors import acoustic_extractor, content_extractor, data_augment
+def extract_acoustic_features(dataset, output_path, cfg, n_workers=1):
+    """Extract acoustic features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+        n_workers (int, optional): num of processes to extract features in parallel. Defaults to 1.
+    """
+    types = ["train", "test"] if "eval" not in dataset else ["test"]
+    metadata = []
+    dataset_output = os.path.join(output_path, dataset)
+    for dataset_type in types:
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+        # acoustic_extractor.extract_utt_acoustic_features_parallel(
+        #     metadata, dataset_output, cfg, n_workers=n_workers
+        # )
+    acoustic_extractor.extract_utt_acoustic_features_serial(
+        metadata, dataset_output, cfg
+    )
+def extract_content_features(dataset, output_path, cfg, num_workers=1):
+    """Extract content features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+    """
+    types = ["train", "test"] if "eval" not in dataset else ["test"]
+    metadata = []
+    for dataset_type in types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+    content_extractor.extract_utt_content_features_dataloader(
+        cfg, metadata, num_workers
+    )
+def preprocess(cfg, args):
+    """Proprocess raw data of single or multiple datasets (in cfg.dataset)
+    Args:
+        cfg (dict): dictionary that stores configurations
+        args (ArgumentParser): specify the configuration file and num_workers
+    """
+    # Specify the output root path to save the processed data
+    output_path = cfg.preprocess.processed_dir
+    os.makedirs(output_path, exist_ok=True)
+    ## Split train and test sets
+    for dataset in cfg.dataset:
+        print("Preprocess {}...".format(dataset))
+        preprocess_dataset(
+            dataset,
+            cfg.dataset_path[dataset],
+            output_path,
+            cfg.preprocess,
+            cfg.task_type,
+            is_custom_dataset=dataset in cfg.use_custom_dataset,
+        )
+    # Data augmentation: create new wav files with pitch shift, formant shift, equalizer, time stretch
+    try:
+        assert isinstance(
+            cfg.preprocess.data_augment, list
+        ), "Please provide a list of datasets need to be augmented."
+        if len(cfg.preprocess.data_augment) > 0:
+            new_datasets_list = []
+            for dataset in cfg.preprocess.data_augment:
+                new_datasets = data_augment.augment_dataset(cfg, dataset)
+                new_datasets_list.extend(new_datasets)
+            cfg.dataset.extend(new_datasets_list)
+            print("Augmentation datasets: ", cfg.dataset)
+    except:
+        print("No Data Augmentation.")
+    # Dump metadata of datasets (singers, train/test durations, etc.)
+    cal_metadata(cfg)
+    ## Prepare the acoustic features
+    for dataset in cfg.dataset:
+        # Skip augmented datasets which do not need to extract acoustic features
+        # We will copy acoustic features from the original dataset later
+        if (
+            "pitch_shift" in dataset
+            or "formant_shift" in dataset
+            or "equalizer" in dataset in dataset
+        ):
+            continue
+        print(
+            "Extracting acoustic features for {} using {} workers ...".format(
+                dataset, args.num_workers
+            )
+        )
+        extract_acoustic_features(dataset, output_path, cfg, args.num_workers)
+        # Calculate the statistics of acoustic features
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics_svc(dataset, output_path, cfg)
+    # Copy acoustic features for augmented datasets by creating soft-links
+    for dataset in cfg.dataset:
+        if "pitch_shift" in dataset:
+            src_dataset = dataset.replace("_pitch_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "formant_shift" in dataset:
+            src_dataset = dataset.replace("_formant_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "equalizer" in dataset:
+            src_dataset = dataset.replace("_equalizer", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        else:
+            continue
+        dataset_dir = os.path.join(output_path, dataset)
+        metadata = []
+        for split in ["train", "test"] if not "eval" in dataset else ["test"]:
+            metadata_file_path = os.path.join(src_dataset_dir, "{}.json".format(split))
+            with open(metadata_file_path, "r") as f:
+                metadata.extend(json.load(f))
+        print("Copying acoustic features for {}...".format(dataset))
+        acoustic_extractor.copy_acoustic_features(
+            metadata, dataset_dir, src_dataset_dir, cfg
+        )
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+    # Prepare the content features
+    for dataset in cfg.dataset:
+        print("Extracting content features for {}...".format(dataset))
+        extract_content_features(dataset, output_path, cfg, args.num_workers)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="config.json", help="json files for configurations."
+    )
+    parser.add_argument("--num_workers", type=int, default=int(cpu_count()))
+    parser.add_argument("--prepare_alignment", type=bool, default=False)
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    preprocess(cfg, args)
+if __name__ == "__main__":
+    main()

Amphion/bins/svc/train.py ADDED Viewed

	@@ -0,0 +1,111 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import torch
+from models.svc.diffusion.diffusion_trainer import DiffusionTrainer
+from models.svc.comosvc.comosvc_trainer import ComoSVCTrainer
+from models.svc.transformer.transformer_trainer import TransformerTrainer
+from models.svc.vits.vits_trainer import VitsSVCTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "DiffWaveNetSVC": DiffusionTrainer,
+        "DiffComoSVC": ComoSVCTrainer,
+        "TransformerSVC": TransformerTrainer,
+        "VitsSVC": VitsSVCTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume",
+        action="store_true",
+        help="If specified, to resume from the existing checkpoint.",
+    )
+    parser.add_argument(
+        "--resume_from_ckpt_path",
+        type=str,
+        default="",
+        help="The specific checkpoint path that you want to resume from.",
+    )
+    parser.add_argument(
+        "--resume_type",
+        type=str,
+        default="",
+        help="`resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights",
+    )
+    parser.add_argument(
+        "--log_level", default="warning", help="logging level (debug, info, warning)"
+    )
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    # Data Augmentation
+    if (
+        type(cfg.preprocess.data_augment) == list
+        and len(cfg.preprocess.data_augment) > 0
+    ):
+        new_datasets_list = []
+        for dataset in cfg.preprocess.data_augment:
+            new_datasets = [
+                f"{dataset}_pitch_shift" if cfg.preprocess.use_pitch_shift else None,
+                (
+                    f"{dataset}_formant_shift"
+                    if cfg.preprocess.use_formant_shift
+                    else None
+                ),
+                f"{dataset}_equalizer" if cfg.preprocess.use_equalizer else None,
+                f"{dataset}_time_stretch" if cfg.preprocess.use_time_stretch else None,
+            ]
+            new_datasets_list.extend(filter(None, new_datasets))
+        cfg.dataset.extend(new_datasets_list)
+    # CUDA settings
+    cuda_relevant()
+    # Build trainer
+    trainer = build_trainer(args, cfg)
+    trainer.train_loop()
+if __name__ == "__main__":
+    main()

Amphion/bins/tta/inference.py ADDED Viewed

	@@ -0,0 +1,94 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+from argparse import ArgumentParser
+import os
+from models.tta.ldm.audioldm_inference import AudioLDMInference
+from utils.util import save_config, load_model_config, load_config
+import numpy as np
+import torch
+def build_inference(args, cfg):
+    supported_inference = {
+        "AudioLDM": AudioLDMInference,
+    }
+    inference_class = supported_inference[cfg.model_type]
+    inference = inference_class(args, cfg)
+    return inference
+def build_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="JSON/YAML file for configurations.",
+    )
+    parser.add_argument(
+        "--text",
+        help="Text to be synthesized",
+        type=str,
+        default="Text to be synthesized.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        type=str,
+    )
+    parser.add_argument(
+        "--vocoder_path", type=str, help="Checkpoint path of the vocoder"
+    )
+    parser.add_argument(
+        "--vocoder_config_path", type=str, help="Config path of the vocoder"
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Output dir for saving generated results",
+    )
+    parser.add_argument(
+        "--num_steps",
+        type=int,
+        default=200,
+        help="The total number of denosing steps",
+    )
+    parser.add_argument(
+        "--guidance_scale",
+        type=float,
+        default=4.0,
+        help="The scale of classifer free guidance",
+    )
+    parser.add_argument("--local_rank", default=-1, type=int)
+    return parser
+def main():
+    # Parse arguments
+    args = build_parser().parse_args()
+    # args, infer_type = formulate_parser(args)
+    # Parse config
+    cfg = load_config(args.config)
+    if torch.cuda.is_available():
+        args.local_rank = torch.device("cuda")
+    else:
+        args.local_rank = torch.device("cpu")
+    print("args: ", args)
+    # Build inference
+    inferencer = build_inference(args, cfg)
+    # Run inference
+    inferencer.inference()
+if __name__ == "__main__":
+    main()

Amphion/bins/tta/preprocess.py ADDED Viewed

	@@ -0,0 +1,195 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import faulthandler
+faulthandler.enable()
+import os
+import argparse
+import json
+import pyworld as pw
+from multiprocessing import cpu_count
+from utils.util import load_config
+from preprocessors.processor import preprocess_dataset, prepare_align
+from preprocessors.metadata import cal_metadata
+from processors import acoustic_extractor, content_extractor, data_augment
+def extract_acoustic_features(dataset, output_path, cfg, n_workers=1):
+    """Extract acoustic features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+        n_workers (int, optional): num of processes to extract features in parallel. Defaults to 1.
+    """
+    types = ["train", "test"] if "eval" not in dataset else ["test"]
+    metadata = []
+    for dataset_type in types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+        # acoustic_extractor.extract_utt_acoustic_features_parallel(
+        #     metadata, dataset_output, cfg, n_workers=n_workers
+        # )
+    acoustic_extractor.extract_utt_acoustic_features_serial(
+        metadata, dataset_output, cfg
+    )
+def extract_content_features(dataset, output_path, cfg, num_workers=1):
+    """Extract content features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+    """
+    types = ["train", "test"] if "eval" not in dataset else ["test"]
+    metadata = []
+    for dataset_type in types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+    content_extractor.extract_utt_content_features_dataloader(
+        cfg, metadata, num_workers
+    )
+def preprocess(cfg, args):
+    """Proprocess raw data of single or multiple datasets (in cfg.dataset)
+    Args:
+        cfg (dict): dictionary that stores configurations
+        args (ArgumentParser): specify the configuration file and num_workers
+    """
+    # Specify the output root path to save the processed data
+    output_path = cfg.preprocess.processed_dir
+    os.makedirs(output_path, exist_ok=True)
+    ## Split train and test sets
+    for dataset in cfg.dataset:
+        print("Preprocess {}...".format(dataset))
+        if args.prepare_alignment:
+            ## Prepare alignment with MFA
+            print("Prepare alignment {}...".format(dataset))
+            prepare_align(
+                dataset, cfg.dataset_path[dataset], cfg.preprocess, output_path
+            )
+        preprocess_dataset(
+            dataset,
+            cfg.dataset_path[dataset],
+            output_path,
+            cfg.preprocess,
+            cfg.task_type,
+            is_custom_dataset=dataset in cfg.use_custom_dataset,
+        )
+    # Data augmentation: create new wav files with pitch shift, formant shift, equalizer, time stretch
+    try:
+        assert isinstance(
+            cfg.preprocess.data_augment, list
+        ), "Please provide a list of datasets need to be augmented."
+        if len(cfg.preprocess.data_augment) > 0:
+            new_datasets_list = []
+            for dataset in cfg.preprocess.data_augment:
+                new_datasets = data_augment.augment_dataset(cfg, dataset)
+                new_datasets_list.extend(new_datasets)
+            cfg.dataset.extend(new_datasets_list)
+            print("Augmentation datasets: ", cfg.dataset)
+    except:
+        print("No Data Augmentation.")
+    # Dump metadata of datasets (singers, train/test durations, etc.)
+    cal_metadata(cfg)
+    ## Prepare the acoustic features
+    for dataset in cfg.dataset:
+        # Skip augmented datasets which do not need to extract acoustic features
+        # We will copy acoustic features from the original dataset later
+        if (
+            "pitch_shift" in dataset
+            or "formant_shift" in dataset
+            or "equalizer" in dataset in dataset
+        ):
+            continue
+        print(
+            "Extracting acoustic features for {} using {} workers ...".format(
+                dataset, args.num_workers
+            )
+        )
+        extract_acoustic_features(dataset, output_path, cfg, args.num_workers)
+        # Calculate the statistics of acoustic features
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+        if cfg.preprocess.extract_energy:
+            acoustic_extractor.cal_energy_statistics(dataset, output_path, cfg)
+    if cfg.preprocess.align_mel_duration:
+        acoustic_extractor.align_duration_mel(dataset, output_path, cfg)
+    # Copy acoustic features for augmented datasets by creating soft-links
+    for dataset in cfg.dataset:
+        if "pitch_shift" in dataset:
+            src_dataset = dataset.replace("_pitch_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "formant_shift" in dataset:
+            src_dataset = dataset.replace("_formant_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "equalizer" in dataset:
+            src_dataset = dataset.replace("_equalizer", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        else:
+            continue
+        dataset_dir = os.path.join(output_path, dataset)
+        metadata = []
+        for split in ["train", "test"] if not "eval" in dataset else ["test"]:
+            metadata_file_path = os.path.join(src_dataset_dir, "{}.json".format(split))
+            with open(metadata_file_path, "r") as f:
+                metadata.extend(json.load(f))
+        print("Copying acoustic features for {}...".format(dataset))
+        acoustic_extractor.copy_acoustic_features(
+            metadata, dataset_dir, src_dataset_dir, cfg
+        )
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+    # Prepare the content features
+    for dataset in cfg.dataset:
+        print("Extracting content features for {}...".format(dataset))
+        extract_content_features(dataset, output_path, cfg, args.num_workers)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="config.json", help="json files for configurations."
+    )
+    parser.add_argument("--num_workers", type=int, default=int(cpu_count()))
+    parser.add_argument("--prepare_alignment", type=bool, default=False)
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    preprocess(cfg, args)
+if __name__ == "__main__":
+    main()

Amphion/bins/tta/train_tta.py ADDED Viewed

	@@ -0,0 +1,77 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import torch
+from models.tta.autoencoder.autoencoder_trainer import AutoencoderKLTrainer
+from models.tta.ldm.audioldm_trainer import AudioLDMTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "AutoencoderKL": AutoencoderKLTrainer,
+        "AudioLDM": AudioLDMTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--num_workers", type=int, default=6, help="Number of dataloader workers."
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume",
+        type=str,
+        default=None,
+        # action="store_true",
+        help="The model name to restore",
+    )
+    parser.add_argument(
+        "--log_level", default="info", help="logging level (info, debug, warning)"
+    )
+    parser.add_argument("--stdout_interval", default=5, type=int)
+    parser.add_argument("--local_rank", default=-1, type=int)
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    cfg.exp_name = args.exp_name
+    # Model saving dir
+    args.log_dir = os.path.join(cfg.log_dir, args.exp_name)
+    os.makedirs(args.log_dir, exist_ok=True)
+    if not cfg.train.ddp:
+        args.local_rank = torch.device("cuda")
+    # Build trainer
+    trainer = build_trainer(args, cfg)
+    # Restore models
+    if args.resume:
+        trainer.restore()
+    trainer.train()
+if __name__ == "__main__":
+    main()

Amphion/bins/tts/inference.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+from argparse import ArgumentParser
+import os
+from models.tts.fastspeech2.fs2_inference import FastSpeech2Inference
+from models.tts.vits.vits_inference import VitsInference
+from models.tts.valle.valle_inference import VALLEInference
+from models.tts.naturalspeech2.ns2_inference import NS2Inference
+from models.tts.jets.jets_inference import JetsInference
+from utils.util import load_config
+import torch
+def build_inference(args, cfg):
+    supported_inference = {
+        "FastSpeech2": FastSpeech2Inference,
+        "VITS": VitsInference,
+        "VALLE": VALLEInference,
+        "NaturalSpeech2": NS2Inference,
+        "Jets": JetsInference,
+    }
+    inference_class = supported_inference[cfg.model_type]
+    inference = inference_class(args, cfg)
+    return inference
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def build_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="JSON/YAML file for configurations.",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        help="convert from the source data",
+        default=None,
+    )
+    parser.add_argument(
+        "--testing_set",
+        type=str,
+        help="train, test, golden_test",
+        default="test",
+    )
+    parser.add_argument(
+        "--test_list_file",
+        type=str,
+        help="convert from the test list file",
+        default=None,
+    )
+    parser.add_argument(
+        "--speaker_name",
+        type=str,
+        default=None,
+        help="speaker name for multi-speaker synthesis, for single-sentence mode only",
+    )
+    parser.add_argument(
+        "--text",
+        help="Text to be synthesized.",
+        type=str,
+        default="",
+    )
+    parser.add_argument(
+        "--vocoder_dir",
+        type=str,
+        default=None,
+        help="Vocoder checkpoint directory. Searching behavior is the same as "
+        "the acoustics one.",
+    )
+    parser.add_argument(
+        "--acoustics_dir",
+        type=str,
+        default=None,
+        help="Acoustic model checkpoint directory. If a directory is given, "
+        "search for the latest checkpoint dir in the directory. If a specific "
+        "checkpoint dir is given, directly load the checkpoint.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        type=str,
+        default=None,
+        help="Acoustic model checkpoint directory. If a directory is given, "
+        "search for the latest checkpoint dir in the directory. If a specific "
+        "checkpoint dir is given, directly load the checkpoint.",
+    )
+    parser.add_argument(
+        "--mode",
+        type=str,
+        choices=["batch", "single"],
+        required=True,
+        help="Synthesize a whole dataset or a single sentence",
+    )
+    parser.add_argument(
+        "--log_level",
+        type=str,
+        default="warning",
+        help="Logging level. Default: warning",
+    )
+    parser.add_argument(
+        "--pitch_control",
+        type=float,
+        default=1.0,
+        help="control the pitch of the whole utterance, larger value for higher pitch",
+    )
+    parser.add_argument(
+        "--energy_control",
+        type=float,
+        default=1.0,
+        help="control the energy of the whole utterance, larger value for larger volume",
+    )
+    parser.add_argument(
+        "--duration_control",
+        type=float,
+        default=1.0,
+        help="control the speed of the whole utterance, larger value for slower speaking rate",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Output dir for saving generated results",
+    )
+    return parser
+def main():
+    # Parse arguments
+    parser = build_parser()
+    VALLEInference.add_arguments(parser)
+    NS2Inference.add_arguments(parser)
+    args = parser.parse_args()
+    print(args)
+    # Parse config
+    cfg = load_config(args.config)
+    # CUDA settings
+    cuda_relevant()
+    # Build inference
+    inferencer = build_inference(args, cfg)
+    # Run inference
+    inferencer.inference()
+if __name__ == "__main__":
+    main()

Amphion/bins/tts/preprocess.py ADDED Viewed

	@@ -0,0 +1,244 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import faulthandler
+faulthandler.enable()
+import os
+import argparse
+import json
+import pyworld as pw
+from multiprocessing import cpu_count
+from utils.util import load_config
+from preprocessors.processor import preprocess_dataset, prepare_align
+from preprocessors.metadata import cal_metadata
+from processors import (
+    acoustic_extractor,
+    content_extractor,
+    data_augment,
+    phone_extractor,
+)
+def extract_acoustic_features(dataset, output_path, cfg, dataset_types, n_workers=1):
+    """Extract acoustic features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+        n_workers (int, optional): num of processes to extract features in parallel. Defaults to 1.
+    """
+    metadata = []
+    for dataset_type in dataset_types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+        # acoustic_extractor.extract_utt_acoustic_features_parallel(
+        #     metadata, dataset_output, cfg, n_workers=n_workers
+        # )
+    acoustic_extractor.extract_utt_acoustic_features_serial(
+        metadata, dataset_output, cfg
+    )
+def extract_content_features(dataset, output_path, cfg, dataset_types, num_workers=1):
+    """Extract content features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+    """
+    metadata = []
+    for dataset_type in dataset_types:
+        dataset_output = os.path.join(output_path, dataset)
+        # dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+    content_extractor.extract_utt_content_features_dataloader(
+        cfg, metadata, num_workers
+    )
+def extract_phonme_sequences(dataset, output_path, cfg, dataset_types):
+    """Extract phoneme features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+    """
+    metadata = []
+    for dataset_type in dataset_types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+    phone_extractor.extract_utt_phone_sequence(dataset, cfg, metadata)
+def preprocess(cfg, args):
+    """Preprocess raw data of single or multiple datasets (in cfg.dataset)
+    Args:
+        cfg (dict): dictionary that stores configurations
+        args (ArgumentParser): specify the configuration file and num_workers
+    """
+    # Specify the output root path to save the processed data
+    output_path = cfg.preprocess.processed_dir
+    os.makedirs(output_path, exist_ok=True)
+    # Split train and test sets
+    for dataset in cfg.dataset:
+        print("Preprocess {}...".format(dataset))
+        if args.prepare_alignment:
+            # Prepare alignment with MFA
+            print("Prepare alignment {}...".format(dataset))
+            prepare_align(
+                dataset, cfg.dataset_path[dataset], cfg.preprocess, output_path
+            )
+        preprocess_dataset(
+            dataset,
+            cfg.dataset_path[dataset],
+            output_path,
+            cfg.preprocess,
+            cfg.task_type,
+            is_custom_dataset=dataset in cfg.use_custom_dataset,
+        )
+    # Data augmentation: create new wav files with pitch shift, formant shift, equalizer, time stretch
+    try:
+        assert isinstance(
+            cfg.preprocess.data_augment, list
+        ), "Please provide a list of datasets need to be augmented."
+        if len(cfg.preprocess.data_augment) > 0:
+            new_datasets_list = []
+            for dataset in cfg.preprocess.data_augment:
+                new_datasets = data_augment.augment_dataset(cfg, dataset)
+                new_datasets_list.extend(new_datasets)
+            cfg.dataset.extend(new_datasets_list)
+            print("Augmentation datasets: ", cfg.dataset)
+    except:
+        print("No Data Augmentation.")
+    # json files
+    dataset_types = list()
+    dataset_types.append((cfg.preprocess.train_file).split(".")[0])
+    dataset_types.append((cfg.preprocess.valid_file).split(".")[0])
+    if "test" not in dataset_types:
+        dataset_types.append("test")
+    if "eval" in dataset:
+        dataset_types = ["test"]
+    # Dump metadata of datasets (singers, train/test durations, etc.)
+    cal_metadata(cfg, dataset_types)
+    # Prepare the acoustic features
+    for dataset in cfg.dataset:
+        # Skip augmented datasets which do not need to extract acoustic features
+        # We will copy acoustic features from the original dataset later
+        if (
+            "pitch_shift" in dataset
+            or "formant_shift" in dataset
+            or "equalizer" in dataset in dataset
+        ):
+            continue
+        print(
+            "Extracting acoustic features for {} using {} workers ...".format(
+                dataset, args.num_workers
+            )
+        )
+        extract_acoustic_features(
+            dataset, output_path, cfg, dataset_types, args.num_workers
+        )
+        # Calculate the statistics of acoustic features
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+        if cfg.preprocess.extract_energy:
+            acoustic_extractor.cal_energy_statistics(dataset, output_path, cfg)
+        if cfg.preprocess.pitch_norm:
+            acoustic_extractor.normalize(dataset, cfg.preprocess.pitch_dir, cfg)
+        if cfg.preprocess.energy_norm:
+            acoustic_extractor.normalize(dataset, cfg.preprocess.energy_dir, cfg)
+    # Copy acoustic features for augmented datasets by creating soft-links
+    for dataset in cfg.dataset:
+        if "pitch_shift" in dataset:
+            src_dataset = dataset.replace("_pitch_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "formant_shift" in dataset:
+            src_dataset = dataset.replace("_formant_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "equalizer" in dataset:
+            src_dataset = dataset.replace("_equalizer", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        else:
+            continue
+        dataset_dir = os.path.join(output_path, dataset)
+        metadata = []
+        for split in ["train", "test"] if not "eval" in dataset else ["test"]:
+            metadata_file_path = os.path.join(src_dataset_dir, "{}.json".format(split))
+            with open(metadata_file_path, "r") as f:
+                metadata.extend(json.load(f))
+        print("Copying acoustic features for {}...".format(dataset))
+        acoustic_extractor.copy_acoustic_features(
+            metadata, dataset_dir, src_dataset_dir, cfg
+        )
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+    # Prepare the content features
+    for dataset in cfg.dataset:
+        print("Extracting content features for {}...".format(dataset))
+        extract_content_features(
+            dataset, output_path, cfg, dataset_types, args.num_workers
+        )
+    # Prepare the phenome squences
+    if cfg.preprocess.extract_phone:
+        for dataset in cfg.dataset:
+            print("Extracting phoneme sequence for {}...".format(dataset))
+            extract_phonme_sequences(dataset, output_path, cfg, dataset_types)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="config.json", help="json files for configurations."
+    )
+    parser.add_argument("--num_workers", type=int, default=int(cpu_count()))
+    parser.add_argument("--prepare_alignment", type=bool, default=False)
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    preprocess(cfg, args)
+if __name__ == "__main__":
+    main()

Amphion/bins/tts/train.py ADDED Viewed

	@@ -0,0 +1,152 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import torch
+from models.tts.fastspeech2.fs2_trainer import FastSpeech2Trainer
+from models.tts.vits.vits_trainer import VITSTrainer
+from models.tts.valle.valle_trainer import VALLETrainer
+from models.tts.naturalspeech2.ns2_trainer import NS2Trainer
+from models.tts.valle_v2.valle_ar_trainer import ValleARTrainer as VALLE_V2_AR
+from models.tts.valle_v2.valle_nar_trainer import ValleNARTrainer as VALLE_V2_NAR
+from models.tts.jets.jets_trainer import JetsTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "FastSpeech2": FastSpeech2Trainer,
+        "VITS": VITSTrainer,
+        "VALLE": VALLETrainer,
+        "NaturalSpeech2": NS2Trainer,
+        "VALLE_V2_AR": VALLE_V2_AR,
+        "VALLE_V2_NAR": VALLE_V2_NAR,
+        "Jets": JetsTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=1234,
+        help="random seed",
+        required=False,
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume", action="store_true", help="The model name to restore"
+    )
+    parser.add_argument(
+        "--test", action="store_true", default=False, help="Test the model"
+    )
+    parser.add_argument(
+        "--log_level", default="warning", help="logging level (debug, info, warning)"
+    )
+    parser.add_argument(
+        "--resume_type",
+        type=str,
+        default="resume",
+        help="Resume training or finetuning.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        type=str,
+        default=None,
+        help="Checkpoint for resume training or finetuning.",
+    )
+    parser.add_argument(
+        "--resume_from_ckpt_path",
+        type=str,
+        default="",
+        help="Checkpoint for resume training or finetuning.",
+    )
+    # VALLETrainer.add_arguments(parser)
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    # Data Augmentation
+    if hasattr(cfg, "preprocess"):
+        if hasattr(cfg.preprocess, "data_augment"):
+            if (
+                type(cfg.preprocess.data_augment) == list
+                and len(cfg.preprocess.data_augment) > 0
+            ):
+                new_datasets_list = []
+                for dataset in cfg.preprocess.data_augment:
+                    new_datasets = [
+                        (
+                            f"{dataset}_pitch_shift"
+                            if cfg.preprocess.use_pitch_shift
+                            else None
+                        ),
+                        (
+                            f"{dataset}_formant_shift"
+                            if cfg.preprocess.use_formant_shift
+                            else None
+                        ),
+                        (
+                            f"{dataset}_equalizer"
+                            if cfg.preprocess.use_equalizer
+                            else None
+                        ),
+                        (
+                            f"{dataset}_time_stretch"
+                            if cfg.preprocess.use_time_stretch
+                            else None
+                        ),
+                    ]
+                    new_datasets_list.extend(filter(None, new_datasets))
+                cfg.dataset.extend(new_datasets_list)
+    print("experiment name: ", args.exp_name)
+    # # CUDA settings
+    cuda_relevant()
+    # Build trainer
+    print(f"Building {cfg.model_type} trainer")
+    trainer = build_trainer(args, cfg)
+    print(f"Start training {cfg.model_type} model")
+    if args.test:
+        trainer.test_loop()
+    else:
+        trainer.train_loop()
+if __name__ == "__main__":
+    main()

Amphion/bins/vc/Noro/train.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import torch
+from models.vc.Noro.noro_trainer import NoroTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "VC": NoroTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume", action="store_true", help="The model name to restore"
+    )
+    parser.add_argument(
+        "--log_level", default="warning", help="logging level (debug, info, warning)"
+    )
+    parser.add_argument(
+        "--resume_type",
+        type=str,
+        default="resume",
+        help="Resume training or finetuning.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        type=str,
+        default=None,
+        help="Checkpoint for resume training or finetuning.",
+    )
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    print("experiment name: ", args.exp_name)
+    # # CUDA settings
+    cuda_relevant()
+    # Build trainer
+    print(f"Building {cfg.model_type} trainer")
+    trainer = build_trainer(args, cfg)
+    torch.set_num_threads(1)
+    torch.set_num_interop_threads(1)
+    print(f"Start training {cfg.model_type} model")
+    trainer.train_loop()
+if __name__ == "__main__":
+    main()

Amphion/bins/vocoder/inference.py ADDED Viewed

	@@ -0,0 +1,115 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import torch
+from models.vocoders.vocoder_inference import VocoderInference
+from utils.util import load_config
+def build_inference(args, cfg, infer_type="infer_from_dataset"):
+    supported_inference = {
+        "GANVocoder": VocoderInference,
+        "DiffusionVocoder": VocoderInference,
+    }
+    inference_class = supported_inference[cfg.model_type]
+    return inference_class(args, cfg, infer_type)
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def build_parser():
+    r"""Build argument parser for inference.py.
+    Anything else should be put in an extra config YAML file.
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="JSON/YAML file for configurations.",
+    )
+    parser.add_argument(
+        "--infer_mode",
+        type=str,
+        required=None,
+    )
+    parser.add_argument(
+        "--infer_datasets",
+        nargs="+",
+        default=None,
+    )
+    parser.add_argument(
+        "--feature_folder",
+        type=str,
+        default=None,
+    )
+    parser.add_argument(
+        "--audio_folder",
+        type=str,
+        default=None,
+    )
+    parser.add_argument(
+        "--vocoder_dir",
+        type=str,
+        required=True,
+        help="Vocoder checkpoint directory. Searching behavior is the same as "
+        "the acoustics one.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="result",
+        help="Output directory. Default: ./result",
+    )
+    parser.add_argument(
+        "--log_level",
+        type=str,
+        default="warning",
+        help="Logging level. Default: warning",
+    )
+    parser.add_argument(
+        "--keep_cache",
+        action="store_true",
+        default=False,
+        help="Keep cache files. Only applicable to inference from files.",
+    )
+    return parser
+def main():
+    # Parse arguments
+    args = build_parser().parse_args()
+    # Parse config
+    cfg = load_config(args.config)
+    # CUDA settings
+    cuda_relevant()
+    # Build inference
+    trainer = build_inference(args, cfg, args.infer_mode)
+    # Run inference
+    trainer.inference()
+if __name__ == "__main__":
+    main()

Amphion/bins/vocoder/preprocess.py ADDED Viewed

	@@ -0,0 +1,151 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import faulthandler
+faulthandler.enable()
+import os
+import argparse
+import json
+import pyworld as pw
+from multiprocessing import cpu_count
+from utils.util import load_config
+from preprocessors.processor import preprocess_dataset, prepare_align
+from preprocessors.metadata import cal_metadata
+from processors import acoustic_extractor, content_extractor, data_augment
+def extract_acoustic_features(dataset, output_path, cfg, n_workers=1):
+    """Extract acoustic features of utterances in the dataset
+    Args:
+        dataset (str): name of dataset, e.g. opencpop
+        output_path (str): directory that stores train, test and feature files of datasets
+        cfg (dict): dictionary that stores configurations
+        n_workers (int, optional): num of processes to extract features in parallel. Defaults to 1.
+    """
+    types = ["train", "test"] if "eval" not in dataset else ["test"]
+    metadata = []
+    for dataset_type in types:
+        dataset_output = os.path.join(output_path, dataset)
+        dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
+        with open(dataset_file, "r") as f:
+            metadata.extend(json.load(f))
+    acoustic_extractor.extract_utt_acoustic_features_serial(
+        metadata, dataset_output, cfg
+    )
+def preprocess(cfg, args):
+    """Proprocess raw data of single or multiple datasets (in cfg.dataset)
+    Args:
+        cfg (dict): dictionary that stores configurations
+        args (ArgumentParser): specify the configuration file and num_workers
+    """
+    # Specify the output root path to save the processed data
+    output_path = cfg.preprocess.processed_dir
+    os.makedirs(output_path, exist_ok=True)
+    ## Split train and test sets
+    for dataset in cfg.dataset:
+        print("Preprocess {}...".format(dataset))
+        preprocess_dataset(
+            dataset,
+            cfg.dataset_path[dataset],
+            output_path,
+            cfg.preprocess,
+            cfg.task_type,
+            is_custom_dataset=dataset in cfg.use_custom_dataset,
+        )
+    # Data augmentation: create new wav files with pitch shift, formant shift, equalizer, time stretch
+    try:
+        assert isinstance(
+            cfg.preprocess.data_augment, list
+        ), "Please provide a list of datasets need to be augmented."
+        if len(cfg.preprocess.data_augment) > 0:
+            new_datasets_list = []
+            for dataset in cfg.preprocess.data_augment:
+                new_datasets = data_augment.augment_dataset(cfg, dataset)
+                new_datasets_list.extend(new_datasets)
+            cfg.dataset.extend(new_datasets_list)
+            print("Augmentation datasets: ", cfg.dataset)
+    except:
+        print("No Data Augmentation.")
+    # Dump metadata of datasets (singers, train/test durations, etc.)
+    cal_metadata(cfg)
+    ## Prepare the acoustic features
+    for dataset in cfg.dataset:
+        # Skip augmented datasets which do not need to extract acoustic features
+        # We will copy acoustic features from the original dataset later
+        if (
+            "pitch_shift" in dataset
+            or "formant_shift" in dataset
+            or "equalizer" in dataset in dataset
+        ):
+            continue
+        print(
+            "Extracting acoustic features for {} using {} workers ...".format(
+                dataset, args.num_workers
+            )
+        )
+        extract_acoustic_features(dataset, output_path, cfg, args.num_workers)
+        # Calculate the statistics of acoustic features
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+    # Copy acoustic features for augmented datasets by creating soft-links
+    for dataset in cfg.dataset:
+        if "pitch_shift" in dataset:
+            src_dataset = dataset.replace("_pitch_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "formant_shift" in dataset:
+            src_dataset = dataset.replace("_formant_shift", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        elif "equalizer" in dataset:
+            src_dataset = dataset.replace("_equalizer", "")
+            src_dataset_dir = os.path.join(output_path, src_dataset)
+        else:
+            continue
+        dataset_dir = os.path.join(output_path, dataset)
+        metadata = []
+        for split in ["train", "test"] if not "eval" in dataset else ["test"]:
+            metadata_file_path = os.path.join(src_dataset_dir, "{}.json".format(split))
+            with open(metadata_file_path, "r") as f:
+                metadata.extend(json.load(f))
+        print("Copying acoustic features for {}...".format(dataset))
+        acoustic_extractor.copy_acoustic_features(
+            metadata, dataset_dir, src_dataset_dir, cfg
+        )
+        if cfg.preprocess.mel_min_max_norm:
+            acoustic_extractor.cal_mel_min_max(dataset, output_path, cfg)
+        if cfg.preprocess.extract_pitch:
+            acoustic_extractor.cal_pitch_statistics(dataset, output_path, cfg)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="config.json", help="json files for configurations."
+    )
+    parser.add_argument("--num_workers", type=int, default=int(cpu_count()))
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    preprocess(cfg, args)
+if __name__ == "__main__":
+    main()

Amphion/bins/vocoder/train.py ADDED Viewed

	@@ -0,0 +1,93 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import torch
+from models.vocoders.gan.gan_vocoder_trainer import GANVocoderTrainer
+from models.vocoders.diffusion.diffusion_vocoder_trainer import DiffusionVocoderTrainer
+from utils.util import load_config
+def build_trainer(args, cfg):
+    supported_trainer = {
+        "GANVocoder": GANVocoderTrainer,
+        "DiffusionVocoder": DiffusionVocoderTrainer,
+    }
+    trainer_class = supported_trainer[cfg.model_type]
+    trainer = trainer_class(args, cfg)
+    return trainer
+def cuda_relevant(deterministic=False):
+    torch.cuda.empty_cache()
+    # TF32 on Ampere and above
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.allow_tf32 = True
+    # Deterministic
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    torch.use_deterministic_algorithms(deterministic)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="config.json",
+        help="json files for configurations.",
+        required=True,
+    )
+    parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="exp_name",
+        help="A specific name to note the experiment",
+        required=True,
+    )
+    parser.add_argument(
+        "--resume_type",
+        type=str,
+        help="resume for continue to train, finetune for finetuning",
+    )
+    parser.add_argument(
+        "--checkpoint",
+        type=str,
+        help="checkpoint to resume",
+    )
+    parser.add_argument(
+        "--log_level", default="warning", help="logging level (debug, info, warning)"
+    )
+    args = parser.parse_args()
+    cfg = load_config(args.config)
+    # Data Augmentation
+    if cfg.preprocess.data_augment:
+        new_datasets_list = []
+        for dataset in cfg.preprocess.data_augment:
+            new_datasets = [
+                # f"{dataset}_pitch_shift",
+                # f"{dataset}_formant_shift",
+                f"{dataset}_equalizer",
+                f"{dataset}_time_stretch",
+            ]
+            new_datasets_list.extend(new_datasets)
+        cfg.dataset.extend(new_datasets_list)
+    # CUDA settings
+    cuda_relevant()
+    # Build trainer
+    trainer = build_trainer(args, cfg)
+    trainer.train_loop()
+if __name__ == "__main__":
+    main()

Amphion/config/audioldm.json ADDED Viewed

	@@ -0,0 +1,92 @@

+{
+  "base_config": "config/base.json",
+  "model_type": "AudioLDM",
+  "task_type": "tta",
+  "dataset": [
+    "AudioCaps"
+  ],
+  "preprocess": {
+    // feature used for model training
+    "use_spkid": false,
+    "use_uv": false,
+    "use_frame_pitch": false,
+    "use_phone_pitch": false,
+    "use_frame_energy": false,
+    "use_phone_energy": false,
+    "use_mel": false,
+    "use_audio": false,
+    "use_label": false,
+    "use_one_hot": false,
+    "cond_mask_prob": 0.1
+  },
+  // model
+  "model": {
+    "audioldm": {
+      "image_size": 32,
+      "in_channels": 4,
+      "out_channels": 4,
+      "model_channels": 256,
+      "attention_resolutions": [
+        4,
+        2,
+        1
+      ],
+      "num_res_blocks": 2,
+      "channel_mult": [
+        1,
+        2,
+        4
+      ],
+      "num_heads": 8,
+      "use_spatial_transformer": true,
+      "transformer_depth": 1,
+      "context_dim": 768,
+      "use_checkpoint": true,
+      "legacy": false
+    },
+    "autoencoderkl": {
+      "ch": 128,
+      "ch_mult": [
+        1,
+        1,
+        2,
+        2,
+        4
+      ],
+      "num_res_blocks": 2,
+      "in_channels": 1,
+      "z_channels": 4,
+      "out_ch": 1,
+      "double_z": true
+    },
+    "noise_scheduler": {
+      "num_train_timesteps": 1000,
+      "beta_start": 0.00085,
+      "beta_end": 0.012,
+      "beta_schedule": "scaled_linear",
+      "clip_sample": false,
+      "steps_offset": 1,
+      "set_alpha_to_one": false,
+      "skip_prk_steps": true,
+      "prediction_type": "epsilon"
+    }
+  },
+  // train
+  "train": {
+    "lronPlateau": {
+      "factor": 0.9,
+      "patience": 100,
+      "min_lr": 4.0e-5,
+      "verbose": true
+    },
+    "adam": {
+      "lr": 5.0e-5,
+      "betas": [
+        0.9,
+        0.999
+      ],
+      "weight_decay": 1.0e-2,
+      "eps": 1.0e-8
+    }
+  }
+}

Amphion/config/autoencoderkl.json ADDED Viewed

	@@ -0,0 +1,69 @@

+{
+  "base_config": "config/base.json",
+  "model_type": "AutoencoderKL",
+  "task_type": "tta",
+  "dataset": [
+    "AudioCaps"
+  ],
+  "preprocess": {
+    // feature used for model training
+    "use_spkid": false,
+    "use_uv": false,
+    "use_frame_pitch": false,
+    "use_phone_pitch": false,
+    "use_frame_energy": false,
+    "use_phone_energy": false,
+    "use_mel": false,
+    "use_audio": false,
+    "use_label": false,
+    "use_one_hot": false
+  },
+  // model
+  "model": {
+    "autoencoderkl": {
+      "ch": 128,
+      "ch_mult": [
+        1,
+        1,
+        2,
+        2,
+        4
+      ],
+      "num_res_blocks": 2,
+      "in_channels": 1,
+      "z_channels": 4,
+      "out_ch": 1,
+      "double_z": true
+    },
+    "loss": {
+      "kl_weight": 1e-8,
+      "disc_weight": 0.5,
+      "disc_factor": 1.0,
+      "logvar_init": 0.0,
+      "min_adapt_d_weight": 0.0,
+      "max_adapt_d_weight": 10.0,
+      "disc_start": 50001,
+      "disc_in_channels": 1,
+      "disc_num_layers": 3,
+      "use_actnorm": false
+    }
+  },
+  // train
+  "train": {
+    "lronPlateau": {
+      "factor": 0.9,
+      "patience": 100,
+      "min_lr": 4.0e-5,
+      "verbose": true
+    },
+    "adam": {
+      "lr": 4.0e-4,
+      "betas": [
+        0.9,
+        0.999
+      ],
+      "weight_decay": 1.0e-2,
+      "eps": 1.0e-8
+    }
+  }
+}

Amphion/config/base.json ADDED Viewed

	@@ -0,0 +1,185 @@

+{
+  "supported_model_type": [
+    "GANVocoder",
+    "Fastspeech2",
+    "DiffSVC",
+    "Transformer",
+    "EDM",
+    "CD"
+  ],
+  "task_type": "",
+  "dataset": [],
+  "use_custom_dataset": [],
+  "preprocess": {
+    "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon"
+    // trim audio silence
+    "data_augment": false,
+    "trim_silence": false,
+    "num_silent_frames": 8,
+    "trim_fft_size": 512, // fft size used in trimming
+    "trim_hop_size": 128, // hop size used in trimming
+    "trim_top_db": 30, // top db used in trimming sensitive to each dataset
+    // acoustic features
+    "extract_mel": false,
+    "mel_extract_mode": "",
+    "extract_linear_spec": false,
+    "extract_mcep": false,
+    "extract_pitch": false,
+    "extract_acoustic_token": false,
+    "pitch_remove_outlier": false,
+    "extract_uv": false,
+    "pitch_norm": false,
+    "extract_audio": false,
+    "extract_label": false,
+    "pitch_extractor": "parselmouth", // pyin, dio, pyworld, pyreaper, parselmouth, CWT (Continuous Wavelet Transform)
+    "extract_energy": false,
+    "energy_remove_outlier": false,
+    "energy_norm": false,
+    "energy_extract_mode": "from_mel",
+    "extract_duration": false,
+    "extract_amplitude_phase": false,
+    "mel_min_max_norm": false,
+    // lingusitic features
+    "extract_phone": false,
+    "lexicon_path": "./text/lexicon/librispeech-lexicon.txt",
+    // content features
+    "extract_whisper_feature": false,
+    "extract_contentvec_feature": false,
+    "extract_mert_feature": false,
+    "extract_wenet_feature": false,
+    // Settings for data preprocessing
+    "n_mel": 80,
+    "win_size": 480,
+    "hop_size": 120,
+    "sample_rate": 24000,
+    "n_fft": 1024,
+    "fmin": 0,
+    "fmax": 12000,
+    "min_level_db": -115,
+    "ref_level_db": 20,
+    "bits": 8,
+    // Directory names of processed data or extracted features
+    "processed_dir": "processed_data",
+    "trimmed_wav_dir": "trimmed_wavs", // directory name of silence trimed wav
+    "raw_data": "raw_data",
+    "phone_dir": "phones",
+    "wav_dir": "wavs", // directory name of processed wav (such as downsampled waveform)
+    "audio_dir": "audios",
+    "log_amplitude_dir": "log_amplitudes",
+    "phase_dir": "phases",
+    "real_dir": "reals",
+    "imaginary_dir": "imaginarys",
+    "label_dir": "labels",
+    "linear_dir": "linears",
+    "mel_dir": "mels", // directory name of extraced mel features
+    "mcep_dir": "mcep", // directory name of extraced mcep features
+    "dur_dir": "durs",
+    "symbols_dict": "symbols.dict",
+    "lab_dir": "labs", // directory name of extraced label features
+    "wenet_dir": "wenet", // directory name of extraced wenet features
+    "contentvec_dir": "contentvec", // directory name of extraced wenet features
+    "pitch_dir": "pitches", // directory name of extraced pitch features
+    "energy_dir": "energys", // directory name of extracted energy features
+    "phone_pitch_dir": "phone_pitches", // directory name of extraced pitch features
+    "phone_energy_dir": "phone_energys", // directory name of extracted energy features
+    "uv_dir": "uvs", // directory name of extracted unvoiced features
+    "duration_dir": "duration", // ground-truth duration file
+    "phone_seq_file": "phone_seq_file", // phoneme sequence file
+    "file_lst": "file.lst",
+    "train_file": "train.json", // training set, the json file contains detailed information about the dataset, including dataset name, utterance id, duration of the utterance
+    "valid_file": "valid.json", // validattion set
+    "spk2id": "spk2id.json", // used for multi-speaker dataset
+    "utt2spk": "utt2spk", // used for multi-speaker dataset
+    "emo2id": "emo2id.json", // used for multi-emotion dataset
+    "utt2emo": "utt2emo", // used for multi-emotion dataset
+    // Features used for model training
+    "use_text": false,
+    "use_phone": false,
+    "use_phn_seq": false,
+    "use_lab": false,
+    "use_linear": false,
+    "use_mel": false,
+    "use_min_max_norm_mel": false,
+    "use_wav": false,
+    "use_phone_pitch": false,
+    "use_log_scale_pitch": false,
+    "use_phone_energy": false,
+    "use_phone_duration": false,
+    "use_log_scale_energy": false,
+    "use_wenet": false,
+    "use_dur": false,
+    "use_spkid": false, // True: use speaker id for multi-speaker dataset
+    "use_emoid": false, // True: use emotion id for multi-emotion dataset
+    "use_frame_pitch": false,
+    "use_uv": false,
+    "use_frame_energy": false,
+    "use_frame_duration": false,
+    "use_audio": false,
+    "use_label": false,
+    "use_one_hot": false,
+    "use_amplitude_phase": false,
+    "align_mel_duration": false
+  },
+  "train": {
+    "ddp": true,
+    "batch_size": 16,
+    "max_steps": 1000000,
+    // Trackers
+    "tracker": [
+      "tensorboard"
+      // "wandb",
+      // "cometml",
+      // "mlflow",
+    ],
+    "max_epoch": -1,
+    // -1 means no limit
+    "save_checkpoint_stride": [
+      5,
+      20
+    ],
+    // unit is epoch
+    "keep_last": [
+      3,
+      -1
+    ],
+    // -1 means infinite, if one number will broadcast
+    "run_eval": [
+      false,
+      true
+    ],
+    // if one number will broadcast
+    // Fix the random seed
+    "random_seed": 10086,
+    // Optimizer
+    "optimizer": "AdamW",
+    "adamw": {
+      "lr": 4.0e-4
+      // nn model lr
+    },
+    // LR Scheduler
+    "scheduler": "ReduceLROnPlateau",
+    "reducelronplateau": {
+      "factor": 0.8,
+      "patience": 10,
+      // unit is epoch
+      "min_lr": 1.0e-4
+    },
+    // Batchsampler
+    "sampler": {
+      "holistic_shuffle": true,
+      "drop_last": true
+    },
+    // Dataloader
+    "dataloader": {
+      "num_worker": 32,
+      "pin_memory": true
+    },
+    "gradient_accumulation_step": 1,
+    "total_training_steps": 50000,
+    "save_summary_steps": 500,
+    "save_checkpoints_steps": 10000,
+    "valid_interval": 10000,
+    "keep_checkpoint_max": 5,
+    "multi_speaker_training": false // True: train multi-speaker model; False: training single-speaker model;
+  }
+}

Amphion/config/comosvc.json ADDED Viewed

	@@ -0,0 +1,215 @@

+{
+    "base_config": "config/svc/base.json",
+    "model_type": "DiffComoSVC",
+    "task_type": "svc",
+    "preprocess": {
+        // data augmentations
+        "use_pitch_shift": false,
+        "use_formant_shift": false,
+        "use_time_stretch": false,
+        "use_equalizer": false,
+        // acoustic features
+        "extract_mel": true,
+        "mel_min_max_norm": true,
+        "extract_pitch": true,
+        "pitch_extractor": "parselmouth",
+        "extract_uv": true,
+        "extract_energy": true,
+        // content features
+        "extract_whisper_feature": false,
+        "whisper_sample_rate": 16000,
+        "extract_contentvec_feature": false,
+        "contentvec_sample_rate": 16000,
+        "extract_wenet_feature": false,
+        "wenet_sample_rate": 16000,
+        "extract_mert_feature": false,
+        "mert_sample_rate": 16000,
+        // Default config for whisper
+        "whisper_frameshift": 0.01,
+        "whisper_downsample_rate": 2,
+        // Default config for content vector
+        "contentvec_frameshift": 0.02,
+        // Default config for mert
+        "mert_model": "m-a-p/MERT-v1-330M",
+        "mert_feature_layer": -1,
+        "mert_hop_size": 320,
+        // 24k
+        "mert_frameshit": 0.01333,
+        // 10ms
+        "wenet_frameshift": 0.01,
+        // wenetspeech is 4, gigaspeech is 6
+        "wenet_downsample_rate": 4,
+        // Default config
+        "n_mel": 100,
+        "win_size": 1024,
+        // todo
+        "hop_size": 256,
+        "sample_rate": 24000,
+        "n_fft": 1024,
+        // todo
+        "fmin": 0,
+        "fmax": 12000,
+        // todo
+        "f0_min": 50,
+        // ~C2
+        "f0_max": 1100,
+        //1100,    // ~C6(1100), ~G5(800)
+        "pitch_bin": 256,
+        "pitch_max": 1100.0,
+        "pitch_min": 50.0,
+        "is_label": true,
+        "is_mu_law": true,
+        "bits": 8,
+        "mel_min_max_stats_dir": "mel_min_max_stats",
+        "whisper_dir": "whisper",
+        "contentvec_dir": "contentvec",
+        "wenet_dir": "wenet",
+        "mert_dir": "mert",
+        // Extract content features using dataloader
+        "pin_memory": true,
+        "num_workers": 8,
+        "content_feature_batch_size": 16,
+        // Features used for model training
+        "use_mel": true,
+        "use_min_max_norm_mel": true,
+        "use_frame_pitch": true,
+        "use_uv": true,
+        "use_frame_energy": true,
+        "use_log_scale_pitch": false,
+        "use_log_scale_energy": false,
+        "use_spkid": true,
+        // Meta file
+        "train_file": "train.json",
+        "valid_file": "test.json",
+        "spk2id": "singers.json",
+        "utt2spk": "utt2singer"
+    },
+    "model": {
+        "teacher_model_path": "[Your Teacher Model Path].bin",
+        "condition_encoder": {
+            "merge_mode": "add",
+            "input_melody_dim": 1,
+            "use_log_f0": true,
+            "n_bins_melody": 256,
+            //# Quantization (0 for not quantization)
+            "output_melody_dim": 384,
+            "input_loudness_dim": 1,
+            "use_log_loudness": true,
+            "n_bins_loudness": 256,
+            "output_loudness_dim": 384,
+            "use_whisper": false,
+            "use_contentvec": false,
+            "use_wenet": false,
+            "use_mert": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "mert_dim": 256,
+            "wenet_dim": 512,
+            "content_encoder_dim": 384,
+            "output_singer_dim": 384,
+            "singer_table_size": 512,
+            "output_content_dim": 384,
+            "use_spkid": true
+        },
+        "comosvc": {
+            "distill": false,
+            // conformer encoder
+            "input_dim": 384,
+            "output_dim": 100,
+            "n_heads": 2,
+            "n_layers": 6,
+            "filter_channels": 512,
+            "dropout": 0.1,
+            // karras diffusion
+            "P_mean": -1.2,
+            "P_std": 1.2,
+            "sigma_data": 0.5,
+            "sigma_min": 0.002,
+            "sigma_max": 80,
+            "rho": 7,
+            "n_timesteps": 18,
+        },
+        "diffusion": {
+            // Diffusion steps encoder
+            "step_encoder": {
+                "dim_raw_embedding": 128,
+                "dim_hidden_layer": 512,
+                "activation": "SiLU",
+                "num_layer": 2,
+                "max_period": 10000
+            },
+            // Diffusion decoder
+            "model_type": "bidilconv",
+            // bidilconv, unet2d, TODO: unet1d
+            "bidilconv": {
+                "base_channel": 384,
+                "n_res_block": 20,
+                "conv_kernel_size": 3,
+                "dilation_cycle_length": 4,
+                // specially, 1 means no dilation
+                "conditioner_size": 100
+            }
+        },
+    },
+    "train": {
+        // Basic settings
+        "fast_steps": 0,
+        "batch_size": 64,
+        "gradient_accumulation_step": 1,
+        "max_epoch": -1,
+        // -1 means no limit
+        "save_checkpoint_stride": [
+            10,
+            100
+        ],
+        // unit is epoch
+        "keep_last": [
+            3,
+            -1
+        ],
+        // -1 means infinite, if one number will broadcast
+        "run_eval": [
+            false,
+            true
+        ],
+        // if one number will broadcast
+        // Fix the random seed
+        "random_seed": 10086,
+        // Batchsampler
+        "sampler": {
+            "holistic_shuffle": true,
+            "drop_last": true
+        },
+        // Dataloader
+        "dataloader": {
+            "num_worker": 32,
+            "pin_memory": true
+        },
+        // Trackers
+        "tracker": [
+            "tensorboard"
+            // "wandb",
+            // "cometml",
+            // "mlflow",
+        ],
+        // Optimizer
+        "optimizer": "AdamW",
+        "adamw": {
+            "lr": 5.0e-5
+            // nn model lr
+        },
+        // LR Scheduler
+        "scheduler": "ReduceLROnPlateau",
+        "reducelronplateau": {
+            "factor": 0.8,
+            "patience": 10,
+            // unit is epoch
+            "min_lr": 5.0e-6
+        }
+    },
+    "inference": {
+        "comosvc": {
+            "inference_steps": 40
+        }
+    }
+}

Amphion/config/facodec.json ADDED Viewed

	@@ -0,0 +1,67 @@

+{
+    "exp_name": "facodec",
+    "model_type": "FAcodec",
+    "log_dir": "./runs/",
+    "log_interval": 10,
+    "save_interval": 1000,
+    "device": "cuda",
+    "epochs": 1000,
+    "batch_size": 4,
+    "batch_length": 100,
+    "max_len": 80,
+    "pretrained_model": "",
+    "load_only_params": false,
+    "F0_path": "modules/JDC/bst.t7",
+    "dataset": "dummy",
+    "preprocess_params": {
+        "sr": 24000,
+        "frame_rate": 80,
+        "duration_range": [1.0, 25.0],
+        "spect_params": {
+            "n_fft": 2048,
+            "win_length": 1200,
+            "hop_length": 300,
+            "n_mels": 80,
+        },
+    },
+    "train": {
+        "gradient_accumulation_step": 1,
+        "batch_size": 1,
+        "save_checkpoint_stride": [20],
+        "random_seed": 1234,
+        "max_epoch": -1,
+        "max_frame_len": 80,
+        "tracker": ["tensorboard"],
+        "run_eval": [false],
+        "sampler": {"holistic_shuffle": true, "drop_last": true},
+        "dataloader": {"num_worker": 0, "pin_memory": true},
+    },
+    "model_params": {
+        "causal": true,
+        "lstm": 2,
+        "norm_f0": true,
+        "use_gr_content_f0": false,
+        "use_gr_prosody_phone": false,
+        "use_gr_timbre_prosody": false,
+        "separate_prosody_encoder": true,
+        "n_c_codebooks": 2,
+        "timbre_norm": true,
+        "use_gr_content_global_f0": true,
+        "DAC": {
+            "encoder_dim": 64,
+            "encoder_rates": [2, 5, 5, 6],
+            "decoder_dim": 1536,
+            "decoder_rates": [6, 5, 5, 2],
+            "sr": 24000,
+        },
+    },
+    "loss_params": {
+        "base_lr": 0.0001,
+        "warmup_steps": 200,
+        "discriminator_iter_start": 2000,
+        "lambda_spk": 1.0,
+        "lambda_mel": 45,
+        "lambda_f0": 1.0,
+        "lambda_uv": 1.0,
+    },
+}

Amphion/config/fs2.json ADDED Viewed

	@@ -0,0 +1,120 @@

+{
+    "base_config": "config/tts.json",
+    "model_type": "FastSpeech2",
+    "task_type": "tts",
+    "dataset": ["LJSpeech"],
+    "preprocess": {
+      // acoustic features
+      "extract_audio": true,
+      "extract_mel": true,
+      "mel_extract_mode": "taco",
+      "mel_min_max_norm": false,
+      "extract_pitch": true,
+      "extract_uv": false,
+      "pitch_extractor": "dio",
+      "extract_energy": true,
+      "energy_extract_mode": "from_tacotron_stft",
+      "extract_duration": true,
+      "use_phone": false,
+      "pitch_norm": true,
+      "energy_norm": true,
+      "pitch_remove_outlier": true,
+      "energy_remove_outlier": true,
+      // Default config
+      "n_mel": 80,
+      "win_size": 1024,  // todo
+      "hop_size": 256,
+      "sample_rate": 22050,
+      "n_fft": 1024, // todo
+      "fmin": 0,
+      "fmax": 8000, // todo
+      "raw_data": "raw_data",
+      "text_cleaners": ["english_cleaners"],
+      "f0_min": 71,    // ~C2
+      "f0_max": 800, //1100,    // ~C6(1100), ~G5(800)
+      "pitch_bin": 256,
+      "pitch_max": 1100.0,
+      "pitch_min": 50.0,
+      "is_label": true,
+      "is_mu_law": true,
+      "bits": 8,
+      "mel_min_max_stats_dir": "mel_min_max_stats",
+      "whisper_dir": "whisper",
+      "content_vector_dir": "content_vector",
+      "wenet_dir": "wenet",
+      "mert_dir": "mert",
+      "spk2id":"spk2id.json",
+      "utt2spk":"utt2spk",
+      "valid_file": "test.json",
+      // Features used for model training
+      "use_mel": true,
+      "use_min_max_norm_mel": false,
+      "use_frame_pitch": false,
+      "use_frame_energy": false,
+      "use_phone_pitch": true,
+      "use_phone_energy": true,
+      "use_log_scale_pitch": false,
+      "use_log_scale_energy": false,
+      "use_spkid": false,
+      "align_mel_duration": true,
+      "text_cleaners": ["english_cleaners"],
+      "phone_extractor": "lexicon", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
+      },
+    "model": {
+      // Settings for transformer
+      "transformer": {
+        "encoder_layer": 4,
+        "encoder_head": 2,
+        "encoder_hidden": 256,
+        "decoder_layer": 6,
+        "decoder_head": 2,
+        "decoder_hidden": 256,
+        "conv_filter_size": 1024,
+        "conv_kernel_size": [9, 1],
+        "encoder_dropout": 0.2,
+        "decoder_dropout": 0.2
+      },
+      // Settings for variance_predictor
+      "variance_predictor":{
+        "filter_size": 256,
+        "kernel_size": 3,
+        "dropout": 0.5
+      },
+    "variance_embedding":{
+        "pitch_quantization": "linear", // support 'linear' or 'log', 'log' is allowed only if the pitch values are not normalized during preprocessing
+        "energy_quantization": "linear", // support 'linear' or 'log', 'log' is allowed only if the energy values are not normalized during preprocessing
+        "n_bins": 256
+      },
+    "max_seq_len": 1000
+    },
+    "train":{
+      "batch_size": 16,
+      "max_epoch": 100,
+      "sort_sample": true,
+      "drop_last": true,
+      "group_size": 4,
+      "grad_clip_thresh": 1.0,
+      "dataloader": {
+        "num_worker": 8,
+        "pin_memory": true
+      },
+      "lr_scheduler":{
+        "num_warmup": 4000
+      },
+      // LR Scheduler
+      "scheduler": "NoamLR",
+      // Optimizer
+      "optimizer": "Adam",
+      "adam": {
+        "lr": 0.0625,
+        "betas": [0.9, 0.98],
+        "eps": 0.000000001,
+        "weight_decay": 0.0
+      },
+    }
+}

Amphion/config/jets.json ADDED Viewed

	@@ -0,0 +1,120 @@

+{
+    "base_config": "config/tts.json",
+    "model_type": "Jets",
+    "task_type": "tts",
+    "dataset": ["LJSpeech"],
+    "preprocess": {
+      // acoustic features
+      "extract_audio": true,
+      "extract_mel": true,
+      "mel_extract_mode": "taco",
+      "mel_min_max_norm": false,
+      "extract_pitch": true,
+      "extract_uv": false,
+      "pitch_extractor": "dio",
+      "extract_energy": true,
+      "energy_extract_mode": "from_tacotron_stft",
+      "extract_duration": true,
+      "use_phone": false,
+      "pitch_norm": true,
+      "energy_norm": true,
+      "pitch_remove_outlier": true,
+      "energy_remove_outlier": true,
+      // Default config
+      "n_mel": 80,
+      "win_size": 1024,  // todo
+      "hop_size": 256,
+      "sample_rate": 22050,
+      "n_fft": 1024, // todo
+      "fmin": 0,
+      "fmax": 8000, // todo
+      "raw_data": "raw_data",
+      "text_cleaners": ["english_cleaners"],
+      "f0_min": 71,    // ~C2
+      "f0_max": 800, //1100,    // ~C6(1100), ~G5(800)
+      "pitch_bin": 256,
+      "pitch_max": 1100.0,
+      "pitch_min": 50.0,
+      "is_label": true,
+      "is_mu_law": true,
+      "bits": 8,
+      "mel_min_max_stats_dir": "mel_min_max_stats",
+      "whisper_dir": "whisper",
+      "content_vector_dir": "content_vector",
+      "wenet_dir": "wenet",
+      "mert_dir": "mert",
+      "spk2id":"spk2id.json",
+      "utt2spk":"utt2spk",
+      "valid_file": "test.json",
+      // Features used for model training
+      "use_mel": true,
+      "use_min_max_norm_mel": false,
+      "use_frame_pitch": true,
+      "use_frame_energy": true,
+      "use_phone_pitch": false,
+      "use_phone_energy": false,
+      "use_log_scale_pitch": false,
+      "use_log_scale_energy": false,
+      "use_spkid": false,
+      "align_mel_duration": true,
+      "text_cleaners": ["english_cleaners"],
+      "phone_extractor": "lexicon", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
+      },
+    "model": {
+      // Settings for transformer
+      "transformer": {
+        "encoder_layer": 4,
+        "encoder_head": 2,
+        "encoder_hidden": 256,
+        "decoder_layer": 6,
+        "decoder_head": 2,
+        "decoder_hidden": 256,
+        "conv_filter_size": 1024,
+        "conv_kernel_size": [9, 1],
+        "encoder_dropout": 0.2,
+        "decoder_dropout": 0.2
+      },
+      // Settings for variance_predictor
+      "variance_predictor":{
+        "filter_size": 256,
+        "kernel_size": 3,
+        "dropout": 0.5
+      },
+    "variance_embedding":{
+        "pitch_quantization": "linear", // support 'linear' or 'log', 'log' is allowed only if the pitch values are not normalized during preprocessing
+        "energy_quantization": "linear", // support 'linear' or 'log', 'log' is allowed only if the energy values are not normalized during preprocessing
+        "n_bins": 256
+      },
+    "max_seq_len": 1000
+    },
+    "train":{
+      "batch_size": 16,
+      "max_epoch": 100,
+      "sort_sample": true,
+      "drop_last": true,
+      "group_size": 4,
+      "grad_clip_thresh": 1.0,
+      "dataloader": {
+        "num_worker": 8,
+        "pin_memory": true
+      },
+      "lr_scheduler":{
+        "num_warmup": 4000
+      },
+      // LR Scheduler
+      "scheduler": "NoamLR",
+      // Optimizer
+      "optimizer": "Adam",
+      "adam": {
+        "lr": 0.0625,
+        "betas": [0.9, 0.98],
+        "eps": 0.000000001,
+        "weight_decay": 0.0
+      },
+    }
+}

Amphion/config/noro.json ADDED Viewed

	@@ -0,0 +1,76 @@

+{
+    "base_config": "config/base.json",
+    "model_type": "VC",
+    "dataset": ["mls"],
+    "model": {
+        "reference_encoder": {
+            "encoder_layer": 6,
+            "encoder_hidden": 512,
+            "encoder_head": 8,
+            "conv_filter_size": 2048,
+            "conv_kernel_size": 9,
+            "encoder_dropout": 0.2,
+            "use_skip_connection": false,
+            "use_new_ffn": true,
+            "ref_in_dim": 80,
+            "ref_out_dim": 512,
+            "use_query_emb": true,
+            "num_query_emb": 32
+        },
+        "diffusion": {
+            "beta_min": 0.05,
+            "beta_max": 20,
+            "sigma": 1.0,
+            "noise_factor": 1.0,
+            "ode_solve_method": "euler",
+            "diff_model_type": "WaveNet",
+            "diff_wavenet":{
+                "input_size": 80,
+                "hidden_size": 512,
+                "out_size": 80,
+                "num_layers": 47,
+                "cross_attn_per_layer": 3,
+                "dilation_cycle": 2,
+                "attn_head": 8,
+                "drop_out": 0.2
+            }
+        },
+        "prior_encoder": {
+            "encoder_layer": 6,
+            "encoder_hidden": 512,
+            "encoder_head": 8,
+            "conv_filter_size": 2048,
+            "conv_kernel_size": 9,
+            "encoder_dropout": 0.2,
+            "use_skip_connection": false,
+            "use_new_ffn": true,
+            "vocab_size": 256,
+            "cond_dim": 512,
+            "duration_predictor": {
+                "input_size": 512,
+                "filter_size": 512,
+                "kernel_size": 3,
+                "conv_layers": 30,
+                "cross_attn_per_layer": 3,
+                "attn_head": 8,
+                "drop_out": 0.2
+            },
+            "pitch_predictor": {
+                "input_size": 512,
+                "filter_size": 512,
+                "kernel_size": 5,
+                "conv_layers": 30,
+                "cross_attn_per_layer": 3,
+                "attn_head": 8,
+                "drop_out": 0.5
+            },
+            "pitch_min": 50,
+            "pitch_max": 1100,
+            "pitch_bins_num": 512
+        },
+        "vc_feature": {
+            "content_feature_dim": 768,
+            "hidden_dim": 512
+        }
+    }
+}

Amphion/config/ns2.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+    "base_config": "config/base.json",
+    "model_type": "NaturalSpeech2",
+    "dataset": ["libritts"],
+    "preprocess": {
+        "use_mel": false,
+        "use_code": true,
+        "use_spkid": true,
+        "use_pitch": true,
+        "use_duration": true,
+        "use_phone": true,
+        "use_len": true,
+        "use_cross_reference": true,
+        "train_file": "train.json",
+        "melspec_dir": "mel",
+        "code_dir": "code",
+        "pitch_dir": "pitch",
+        "duration_dir": "duration",
+        "clip_mode": "start"
+    },
+    "model": {
+        "latent_dim": 128,
+        "prior_encoder": {
+            "vocab_size": 100,
+            "pitch_min": 50,
+            "pitch_max": 1100,
+            "pitch_bins_num": 512,
+            "encoder": {
+                "encoder_layer": 6,
+                "encoder_hidden": 512,
+                "encoder_head": 8,
+                "conv_filter_size": 2048,
+                "conv_kernel_size": 9,
+                "encoder_dropout": 0.2,
+                "use_cln": true
+            },
+            "duration_predictor": {
+                "input_size": 512,
+                "filter_size": 512,
+                "kernel_size": 3,
+                "conv_layers": 30,
+                "cross_attn_per_layer": 3,
+                "attn_head": 8,
+                "drop_out": 0.5
+            },
+            "pitch_predictor": {
+                "input_size": 512,
+                "filter_size": 512,
+                "kernel_size": 5,
+                "conv_layers": 30,
+                "cross_attn_per_layer": 3,
+                "attn_head": 8,
+                "drop_out": 0.5
+            }
+        },
+        "diffusion": {
+            "wavenet": {
+                "input_size": 128,
+                "hidden_size": 512,
+                "out_size": 128,
+                "num_layers": 40,
+                "cross_attn_per_layer": 3,
+                "dilation_cycle": 2,
+                "attn_head": 8,
+                "drop_out": 0.2
+            },
+            "beta_min": 0.05,
+            "beta_max": 20,
+            "sigma": 1.0,
+            "noise_factor": 1.0,
+            "ode_solver": "euler"
+        },
+        "prompt_encoder": {
+            "encoder_layer": 6,
+            "encoder_hidden": 512,
+            "encoder_head": 8,
+            "conv_filter_size": 2048,
+            "conv_kernel_size": 9,
+            "encoder_dropout": 0.2,
+            "use_cln": false
+        },
+        "query_emb": {
+            "query_token_num": 32,
+            "hidden_size": 512,
+            "head_num": 8
+        }
+    }
+}

Amphion/config/svc/base.json ADDED Viewed

	@@ -0,0 +1,119 @@

+{
+    "base_config": "config/base.json",
+    "task_type": "svc",
+    "preprocess": {
+        // data augmentations
+        "use_pitch_shift": false,
+        "use_formant_shift": false,
+        "use_time_stretch": false,
+        "use_equalizer": false,
+        // Online or offline features extraction ("offline" or "online")
+        "features_extraction_mode": "offline",
+        // acoustic features
+        "extract_mel": true,
+        "mel_min_max_norm": true,
+        "extract_pitch": true,
+        "pitch_extractor": "parselmouth",
+        "extract_uv": true,
+        "extract_energy": true,
+        // content features
+        "extract_whisper_feature": false,
+        "whisper_sample_rate": 16000,
+        "extract_contentvec_feature": false,
+        "contentvec_sample_rate": 16000,
+        "extract_wenet_feature": false,
+        "wenet_sample_rate": 16000,
+        "extract_mert_feature": false,
+        "mert_sample_rate": 16000,
+        // Default config for whisper
+        "whisper_frameshift": 0.01,
+        "whisper_downsample_rate": 2,
+        // Default config for content vector
+        "contentvec_frameshift": 0.02,
+        // Default config for mert
+        "mert_model": "m-a-p/MERT-v1-330M",
+        "mert_feature_layer": -1,
+        "mert_hop_size": 320,
+        // 24k
+        "mert_frameshit": 0.01333,
+        // 10ms
+        "wenet_frameshift": 0.01,
+        // wenetspeech is 4, gigaspeech is 6
+        "wenet_downsample_rate": 4,
+        // Default config
+        "n_mel": 100,
+        "win_size": 1024,
+        // todo
+        "hop_size": 256,
+        "sample_rate": 24000,
+        "n_fft": 1024,
+        // todo
+        "fmin": 0,
+        "fmax": 12000,
+        // todo
+        "f0_min": 50,
+        // ~C2
+        "f0_max": 1100,
+        //1100,    // ~C6(1100), ~G5(800)
+        "pitch_bin": 256,
+        "pitch_max": 1100.0,
+        "pitch_min": 50.0,
+        "is_label": true,
+        "is_mu_law": true,
+        "bits": 8,
+        "mel_min_max_stats_dir": "mel_min_max_stats",
+        "whisper_dir": "whisper",
+        "contentvec_dir": "contentvec",
+        "wenet_dir": "wenet",
+        "mert_dir": "mert",
+        // Extract content features using dataloader
+        "pin_memory": true,
+        "num_workers": 8,
+        "content_feature_batch_size": 16,
+        // Features used for model training
+        "use_mel": true,
+        "use_min_max_norm_mel": true,
+        "use_frame_pitch": true,
+        "use_uv": true,
+        "use_interpolation_for_uv": false,
+        "use_frame_energy": true,
+        "use_log_scale_pitch": false,
+        "use_log_scale_energy": false,
+        "use_spkid": true,
+        // Meta file
+        "train_file": "train.json",
+        "valid_file": "test.json",
+        "spk2id": "singers.json",
+        "utt2spk": "utt2singer"
+    },
+    "model": {
+        "condition_encoder": {
+            "merge_mode": "add",
+            // Prosody Features
+            "use_f0": true,
+            "use_uv": true,
+            "use_energy": true,
+            // Quantization (0 for not quantization)
+            "input_melody_dim": 1,
+            "n_bins_melody": 256,
+            "output_melody_dim": 384,
+            "input_loudness_dim": 1,
+            "n_bins_loudness": 256,
+            "output_loudness_dim": 384,
+            // Semantic Features
+            "use_whisper": false,
+            "use_contentvec": false,
+            "use_wenet": false,
+            "use_mert": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "mert_dim": 256,
+            "wenet_dim": 512,
+            "content_encoder_dim": 384,
+            // Speaker Features
+            "output_singer_dim": 384,
+            "singer_table_size": 512,
+            "use_spkid": true
+        }
+    },
+}

Amphion/config/svc/diffusion.json ADDED Viewed

	@@ -0,0 +1,142 @@

+{
+    "base_config": "config/svc/base.json",
+    "model": {
+        "condition_encoder": {
+            "merge_mode": "add",
+            // Prosody Features
+            "use_f0": true,
+            "use_uv": true,
+            "use_energy": true,
+            // Quantization (0 for not quantization)
+            "input_melody_dim": 1,
+            "n_bins_melody": 256,
+            "output_melody_dim": 384,
+            "input_loudness_dim": 1,
+            "n_bins_loudness": 256,
+            "output_loudness_dim": 384,
+            // Semantic Features
+            "use_whisper": false,
+            "use_contentvec": false,
+            "use_wenet": false,
+            "use_mert": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "mert_dim": 256,
+            "wenet_dim": 512,
+            "content_encoder_dim": 384,
+            // Speaker Features
+            "output_singer_dim": 384,
+            "singer_table_size": 512,
+            "use_spkid": true
+        },
+        "diffusion": {
+            "scheduler": "ddpm",
+            "scheduler_settings": {
+                "num_train_timesteps": 1000,
+                "beta_start": 1.0e-4,
+                "beta_end": 0.02,
+                "beta_schedule": "linear"
+            },
+            // Diffusion steps encoder
+            "step_encoder": {
+                "dim_raw_embedding": 128,
+                "dim_hidden_layer": 512,
+                "activation": "SiLU",
+                "num_layer": 2,
+                "max_period": 10000
+            },
+            // Diffusion decoder
+            "model_type": "bidilconv",
+            // bidilconv, unet2d, TODO: unet1d
+            "bidilconv": {
+                "base_channel": 384,
+                "n_res_block": 20,
+                "conv_kernel_size": 3,
+                "dilation_cycle_length": 4,
+                // specially, 1 means no dilation
+                "conditioner_size": 384
+            },
+            "unet2d": {
+                "in_channels": 1,
+                "out_channels": 1,
+                "down_block_types": [
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "DownBlock2D"
+                ],
+                "mid_block_type": "UNetMidBlock2DCrossAttn",
+                "up_block_types": [
+                    "UpBlock2D",
+                    "CrossAttnUpBlock2D",
+                    "CrossAttnUpBlock2D",
+                    "CrossAttnUpBlock2D"
+                ],
+                "only_cross_attention": false
+            }
+        }
+    },
+    "train": {
+        // Basic settings
+        "batch_size": 64,
+        "gradient_accumulation_step": 1,
+        "max_epoch": -1,
+        // -1 means no limit
+        "save_checkpoint_stride": [
+            5,
+            20
+        ],
+        // unit is epoch
+        "keep_last": [
+            3,
+            -1
+        ],
+        // -1 means infinite, if one number will broadcast
+        "run_eval": [
+            false,
+            true
+        ],
+        // if one number will broadcast
+        // Fix the random seed
+        "random_seed": 10086,
+        // Batchsampler
+        "sampler": {
+            "holistic_shuffle": true,
+            "drop_last": true
+        },
+        // Dataloader
+        "dataloader": {
+            "num_worker": 32,
+            "pin_memory": true
+        },
+        // Trackers
+        "tracker": [
+            "tensorboard"
+            // "wandb",
+            // "cometml",
+            // "mlflow",
+        ],
+        // Optimizer
+        "optimizer": "AdamW",
+        "adamw": {
+            "lr": 4.0e-4
+            // nn model lr
+        },
+        // LR Scheduler
+        "scheduler": "ReduceLROnPlateau",
+        "reducelronplateau": {
+            "factor": 0.8,
+            "patience": 10,
+            // unit is epoch
+            "min_lr": 1.0e-4
+        }
+    },
+    "inference": {
+        "diffusion": {
+            "scheduler": "pndm",
+            "scheduler_settings": {
+                "num_inference_timesteps": 1000
+            }
+        }
+    }
+}

Amphion/config/transformer.json ADDED Viewed

	@@ -0,0 +1,179 @@

+{
+    "base_config": "config/svc/base.json",
+    "model_type": "Transformer",
+    "task_type": "svc",
+    "preprocess": {
+        // data augmentations
+        "use_pitch_shift": false,
+        "use_formant_shift": false,
+        "use_time_stretch": false,
+        "use_equalizer": false,
+        // acoustic features
+        "extract_mel": true,
+        "mel_min_max_norm": true,
+        "extract_pitch": true,
+        "pitch_extractor": "parselmouth",
+        "extract_uv": true,
+        "extract_energy": true,
+        // content features
+        "extract_whisper_feature": false,
+        "whisper_sample_rate": 16000,
+        "extract_contentvec_feature": false,
+        "contentvec_sample_rate": 16000,
+        "extract_wenet_feature": false,
+        "wenet_sample_rate": 16000,
+        "extract_mert_feature": false,
+        "mert_sample_rate": 16000,
+        // Default config for whisper
+        "whisper_frameshift": 0.01,
+        "whisper_downsample_rate": 2,
+        // Default config for content vector
+        "contentvec_frameshift": 0.02,
+        // Default config for mert
+        "mert_model": "m-a-p/MERT-v1-330M",
+        "mert_feature_layer": -1,
+        "mert_hop_size": 320,
+        // 24k
+        "mert_frameshit": 0.01333,
+        // 10ms
+        "wenet_frameshift": 0.01,
+        // wenetspeech is 4, gigaspeech is 6
+        "wenet_downsample_rate": 4,
+        // Default config
+        "n_mel": 100,
+        "win_size": 1024,
+        // todo
+        "hop_size": 256,
+        "sample_rate": 24000,
+        "n_fft": 1024,
+        // todo
+        "fmin": 0,
+        "fmax": 12000,
+        // todo
+        "f0_min": 50,
+        // ~C2
+        "f0_max": 1100,
+        //1100,    // ~C6(1100), ~G5(800)
+        "pitch_bin": 256,
+        "pitch_max": 1100.0,
+        "pitch_min": 50.0,
+        "is_label": true,
+        "is_mu_law": true,
+        "bits": 8,
+        "mel_min_max_stats_dir": "mel_min_max_stats",
+        "whisper_dir": "whisper",
+        "contentvec_dir": "contentvec",
+        "wenet_dir": "wenet",
+        "mert_dir": "mert",
+        // Extract content features using dataloader
+        "pin_memory": true,
+        "num_workers": 8,
+        "content_feature_batch_size": 16,
+        // Features used for model training
+        "use_mel": true,
+        "use_min_max_norm_mel": true,
+        "use_frame_pitch": true,
+        "use_uv": true,
+        "use_frame_energy": true,
+        "use_log_scale_pitch": false,
+        "use_log_scale_energy": false,
+        "use_spkid": true,
+        // Meta file
+        "train_file": "train.json",
+        "valid_file": "test.json",
+        "spk2id": "singers.json",
+        "utt2spk": "utt2singer"
+    },
+    "model": {
+        "condition_encoder": {
+            "merge_mode": "add",
+            "input_melody_dim": 1,
+            "use_log_f0": true,
+            "n_bins_melody": 256,
+            //# Quantization (0 for not quantization)
+            "output_melody_dim": 384,
+            "input_loudness_dim": 1,
+            "use_log_loudness": true,
+            "n_bins_loudness": 256,
+            "output_loudness_dim": 384,
+            "use_whisper": false,
+            "use_contentvec": true,
+            "use_wenet": false,
+            "use_mert": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "mert_dim": 256,
+            "wenet_dim": 512,
+            "content_encoder_dim": 384,
+            "output_singer_dim": 384,
+            "singer_table_size": 512,
+            "output_content_dim": 384,
+            "use_spkid": true
+        },
+        "transformer": {
+            "type": "conformer",
+            // 'conformer' or 'transformer'
+            "input_dim": 384,
+            "output_dim": 100,
+            "n_heads": 2,
+            "n_layers": 6,
+            "filter_channels": 512,
+            "dropout": 0.1,
+        }
+    },
+    "train": {
+        // Basic settings
+        "batch_size": 64,
+        "gradient_accumulation_step": 1,
+        "max_epoch": -1,
+        // -1 means no limit
+        "save_checkpoint_stride": [
+            10,
+            100
+        ],
+        // unit is epoch
+        "keep_last": [
+            3,
+            -1
+        ],
+        // -1 means infinite, if one number will broadcast
+        "run_eval": [
+            false,
+            true
+        ],
+        // if one number will broadcast
+        // Fix the random seed
+        "random_seed": 10086,
+        // Batchsampler
+        "sampler": {
+            "holistic_shuffle": true,
+            "drop_last": true
+        },
+        // Dataloader
+        "dataloader": {
+            "num_worker": 32,
+            "pin_memory": true
+        },
+        // Trackers
+        "tracker": [
+            "tensorboard"
+            // "wandb",
+            // "cometml",
+            // "mlflow",
+        ],
+        // Optimizer
+        "optimizer": "AdamW",
+        "adamw": {
+            "lr": 4.0e-4
+            // nn model lr
+        },
+        // LR Scheduler
+        "scheduler": "ReduceLROnPlateau",
+        "reducelronplateau": {
+            "factor": 0.8,
+            "patience": 10,
+            // unit is epoch
+            "min_lr": 1.0e-4
+        }
+    }
+}

Amphion/config/tts.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "base_config": "config/base.json",
+  "supported_model_type": [
+    "Fastspeech2",
+    "VITS",
+    "VALLE",
+    "NaturalSpeech2"
+  ],
+  "task_type": "tts",
+  "preprocess": {
+    "language": "en-us", //  espeak supports 100 languages https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md
+    // linguistic features
+    "extract_phone": true,
+    "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
+    "lexicon_path": "./text/lexicon/librispeech-lexicon.txt",
+    // Directory names of processed data or extracted features
+    "phone_dir": "phones",
+    "use_phone": true,
+    "add_blank": true
+  },
+  "model": {
+      "text_token_num": 512,
+  }
+}

Amphion/config/valle.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+    "base_config": "config/tts.json",
+    "model_type": "VALLE",
+    "task_type": "tts",
+    "dataset": [
+        "libritts"
+    ],
+    "preprocess": {
+        "extract_phone": true,
+        "phone_extractor": "espeak", // phoneme extractor: espeak, pypinyin, pypinyin_initials_finals or lexicon
+        "extract_acoustic_token": true,
+        "acoustic_token_extractor": "Encodec", // acoustic token extractor: encodec, dac(todo)
+        "acoustic_token_dir": "acoutic_tokens",
+        "use_text": false,
+        "use_phone": true,
+        "use_acoustic_token": true,
+        "symbols_dict": "symbols.dict",
+        "min_duration": 0.5, // the duration lowerbound to filter the audio with duration < min_duration
+        "max_duration": 14, //  the duration uperbound to filter the audio with duration > max_duration.
+        "sample_rate": 24000,
+        "codec_hop_size": 320
+    },
+    "model": {
+        "text_token_num": 512,
+        "audio_token_num": 1024,
+        "decoder_dim": 1024, // embedding dimension of the decoder model
+        "nhead": 16, // number of attention heads in the decoder layers
+        "num_decoder_layers": 12, // number of decoder layers
+        "norm_first": true, // pre or post Normalization.
+        "add_prenet": false, // whether add PreNet after Inputs
+        "prefix_mode": 0, //  mode for how to prefix VALL-E NAR Decoder, 0: no prefix, 1: 0 to random, 2: random to random, 4: chunk of pre or post utterance
+        "share_embedding": true, // share the parameters of the output projection layer with the parameters of the acoustic embedding
+        "nar_scale_factor": 1, // model scale factor which will be assigned different meanings in different models
+        "prepend_bos": false, // whether prepend <BOS> to the acoustic tokens -> AR Decoder inputs
+        "num_quantizers": 8, // numbert of the audio quantization layers
+        // "scaling_xformers": false, // Apply Reworked Conformer scaling on Transformers
+    },
+    "train": {
+        "use_dynamic_batchsize": false, // If use dynamic batch size
+        "ddp": false,
+        "train_stage": 1, // 0: train all modules, For VALL_E, support 1: AR Decoder 2: NAR Decoder(s)
+        "max_epoch": 20,
+        "optimizer": "AdamW",
+        "scheduler": "cosine",
+        "warmup_steps": 16000, // number of steps that affects how rapidly the learning rate decreases
+        "total_training_steps": 800000,
+        "base_lr": 1e-4, // base learning rate."
+        "valid_interval": 1000,
+        "log_epoch_step": 1000,
+        "save_checkpoint_stride": [
+            1,
+            1
+        ]
+    }
+}

Amphion/config/vits.json ADDED Viewed

	@@ -0,0 +1,101 @@

+{
+    "base_config": "config/tts.json",
+    "model_type": "VITS",
+    "task_type": "tts",
+    "preprocess": {
+        "extract_phone": true,
+        "extract_mel": true,
+        "n_mel": 80,
+        "fmin": 0,
+        "fmax": null,
+        "extract_linear_spec": true,
+        "extract_audio": true,
+        "use_linear": true,
+        "use_mel": true,
+        "use_audio": true,
+        "use_text": false,
+        "use_phone": true,
+        "lexicon_path": "./text/lexicon/librispeech-lexicon.txt",
+        "n_fft": 1024,
+        "win_size": 1024,
+        "hop_size": 256,
+        "segment_size": 8192,
+        "text_cleaners": [
+            "english_cleaners"
+        ]
+    },
+    "model": {
+        "text_token_num": 512,
+        "inter_channels": 192,
+        "hidden_channels": 192,
+        "filter_channels": 768,
+        "n_heads": 2,
+        "n_layers": 6,
+        "kernel_size": 3,
+        "p_dropout": 0.1,
+        "resblock": "1",
+        "resblock_kernel_sizes": [
+            3,
+            7,
+            11
+        ],
+        "resblock_dilation_sizes": [
+            [
+                1,
+                3,
+                5
+            ],
+            [
+                1,
+                3,
+                5
+            ],
+            [
+                1,
+                3,
+                5
+            ]
+        ],
+        "upsample_rates": [
+            8,
+            8,
+            2,
+            2
+        ],
+        "upsample_initial_channel": 512,
+        "upsample_kernel_sizes": [
+            16,
+            16,
+            4,
+            4
+        ],
+        "n_layers_q": 3,
+        "use_spectral_norm": false,
+        "n_speakers": 0, // number of speakers, while be automatically set if n_speakers is 0 and multi_speaker_training is true
+        "gin_channels": 256,
+        "use_sdp": true
+    },
+    "train": {
+        "fp16_run": true,
+        "learning_rate": 2e-4,
+        "betas": [
+            0.8,
+            0.99
+        ],
+        "eps": 1e-9,
+        "batch_size": 16,
+        "lr_decay": 0.999875,
+        // "segment_size": 8192,
+        "init_lr_ratio": 1,
+        "warmup_epochs": 0,
+        "c_mel": 45,
+        "c_kl": 1.0,
+        "AdamW": {
+            "betas": [
+                0.8,
+                0.99
+            ],
+            "eps": 1e-9,
+        }
+    }
+}

Amphion/config/vitssvc.json ADDED Viewed

	@@ -0,0 +1,306 @@

+{
+    "base_config": "config/svc/base.json",
+    "model_type": "VITS",
+    "task_type": "svc",
+    "preprocess": {
+        // Config for features extraction
+        "extract_mel": true,
+        "extract_pitch": true,
+        "pitch_extractor": "parselmouth",
+        "extract_energy": true,
+        "extract_uv": true,
+        "extract_linear_spec": true,
+        "extract_audio": true,
+        "mel_min_max_norm": true,
+        // Config for features usage
+        "use_linear": true,
+        "use_mel": true,
+        "use_min_max_norm_mel": false,
+        "use_audio": true,
+        "use_frame_pitch": true,
+        "use_uv": true,
+        "use_spkid": true,
+        "use_contentvec": false,
+        "use_whisper": false,
+        "use_wenet": false,
+        "use_text": false,
+        "use_phone": false,
+        "fmin": 0,
+        "fmax": 12000,
+        "f0_min": 50,
+        "f0_max": 1100,
+        // f0_bin in sovits
+        "pitch_bin": 256,
+        // filter_length in sovits
+        "n_fft": 1024,
+        // hop_length in sovits
+        "hop_size": 256,
+        // win_length in sovits
+        "win_size": 1024,
+        "segment_size": 8192,
+        "n_mel": 100,
+        "sample_rate": 24000,
+        "mel_min_max_stats_dir": "mel_min_max_stats",
+        "whisper_dir": "whisper",
+        "contentvec_dir": "contentvec",
+        "wenet_dir": "wenet",
+        "mert_dir": "mert",
+        // Meta file
+        "train_file": "train.json",
+        "valid_file": "test.json",
+        "spk2id": "singers.json",
+        "utt2spk": "utt2singer"
+    },
+    "model": {
+        "condition_encoder": {
+            "merge_mode": "add",
+            "input_melody_dim": 1,
+            "use_log_f0": true,
+            "n_bins_melody": 256,
+            "output_melody_dim": 384,
+            "input_loudness_dim": 1,
+            "use_log_loudness": true,
+            "n_bins_loudness": 256,
+            "output_loudness_dim": 384,
+            "use_whisper": false,
+            "use_contentvec": false,
+            "use_wenet": false,
+            "use_mert": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "mert_dim": 256,
+            "wenet_dim": 512,
+            "content_encoder_dim": 384,
+            "singer_table_size": 512,
+            "output_singer_dim": 384,
+            "output_content_dim": 384,
+            "use_spkid": true,
+            "pitch_max": 1100.0,
+            "pitch_min": 50.0,
+        },
+        "vits": {
+            "filter_channels": 256,
+            "gin_channels": 256,
+            "hidden_channels": 384,
+            "inter_channels": 384,
+            "kernel_size": 3,
+            "n_flow_layer": 4,
+            "n_heads": 2,
+            "n_layers": 6,
+            "n_layers_q": 3,
+            "n_speakers": 512,
+            "p_dropout": 0.1,
+            "use_spectral_norm": false,
+        },
+        "generator": "hifigan",
+        "generator_config": {
+            "hifigan": {
+                "resblock": "1",
+                "resblock_kernel_sizes": [
+                    3,
+                    7,
+                    11
+                ],
+                "upsample_rates": [
+                    8,
+                    8,
+                    2,
+                    2
+                ],
+                "upsample_kernel_sizes": [
+                    16,
+                    16,
+                    4,
+                    4
+                ],
+                "upsample_initial_channel": 512,
+                "resblock_dilation_sizes": [
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ]
+                ]
+            },
+            "melgan": {
+                "ratios": [
+                    8,
+                    8,
+                    2,
+                    2
+                ],
+                "ngf": 32,
+                "n_residual_layers": 3,
+                "num_D": 3,
+                "ndf": 16,
+                "n_layers": 4,
+                "downsampling_factor": 4
+            },
+            "bigvgan": {
+                "resblock": "1",
+                "activation": "snakebeta",
+                "snake_logscale": true,
+                "upsample_rates": [
+                    8,
+                    8,
+                    2,
+                    2
+                ],
+                "upsample_kernel_sizes": [
+                    16,
+                    16,
+                    4,
+                    4
+                ],
+                "upsample_initial_channel": 512,
+                "resblock_kernel_sizes": [
+                    3,
+                    7,
+                    11
+                ],
+                "resblock_dilation_sizes": [
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ]
+                ]
+            },
+            "nsfhifigan": {
+                "resblock": "1",
+                "harmonic_num": 8,
+                "upsample_rates": [
+                    8,
+                    8,
+                    2,
+                    2
+                ],
+                "upsample_kernel_sizes": [
+                    16,
+                    16,
+                    4,
+                    4
+                ],
+                "upsample_initial_channel": 768,
+                "resblock_kernel_sizes": [
+                    3,
+                    7,
+                    11
+                ],
+                "resblock_dilation_sizes": [
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ]
+                ]
+            },
+            "apnet": {
+                "ASP_channel": 512,
+                "ASP_resblock_kernel_sizes": [
+                    3,
+                    7,
+                    11
+                ],
+                "ASP_resblock_dilation_sizes": [
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ]
+                ],
+                "ASP_input_conv_kernel_size": 7,
+                "ASP_output_conv_kernel_size": 7,
+                "PSP_channel": 512,
+                "PSP_resblock_kernel_sizes": [
+                    3,
+                    7,
+                    11
+                ],
+                "PSP_resblock_dilation_sizes": [
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ],
+                    [
+                        1,
+                        3,
+                        5
+                    ]
+                ],
+                "PSP_input_conv_kernel_size": 7,
+                "PSP_output_R_conv_kernel_size": 7,
+                "PSP_output_I_conv_kernel_size": 7,
+            }
+        },
+    },
+    "train": {
+        "fp16_run": true,
+        "learning_rate": 2e-4,
+        "betas": [
+            0.8,
+            0.99
+        ],
+        "eps": 1e-9,
+        "batch_size": 16,
+        "lr_decay": 0.999875,
+        // "segment_size": 8192,
+        "init_lr_ratio": 1,
+        "warmup_epochs": 0,
+        "c_mel": 45,
+        "c_kl": 1.0,
+        "AdamW": {
+            "betas": [
+                0.8,
+                0.99
+            ],
+            "eps": 1e-9,
+        }
+    }
+}

Amphion/config/vocoder.json ADDED Viewed

	@@ -0,0 +1,84 @@

+{
+  "base_config": "config/base.json",
+  "dataset": [
+    "LJSpeech",
+    "LibriTTS",
+    "opencpop",
+    "m4singer",
+    "svcc",
+    "svcceval",
+    "pjs",
+    "opensinger",
+    "popbutfy",
+    "nus48e",
+    "popcs",
+    "kising",
+    "csd",
+    "opera",
+    "vctk",
+    "lijian",
+    "cdmusiceval"
+  ],
+  "task_type": "vocoder",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_pitch": false,
+    "extract_uv": false,
+    "extract_audio": true,
+    "extract_label": false,
+    "extract_one_hot": false,
+    "extract_amplitude_phase": false,
+    "pitch_extractor": "parselmouth",
+    // Settings for data preprocessing
+    "n_mel": 100,
+    "win_size": 1024,
+    "hop_size": 256,
+    "sample_rate": 24000,
+    "n_fft": 1024,
+    "fmin": 0,
+    "fmax": 12000,
+    "f0_min": 50,
+    "f0_max": 1100,
+    "pitch_bin": 256,
+    "pitch_max": 1100.0,
+    "pitch_min": 50.0,
+    "is_mu_law": false,
+    "bits": 8,
+    "cut_mel_frame": 32,
+    // Directory names of processed data or extracted features
+    "spk2id": "singers.json",
+    // Features used for model training
+    "use_mel": true,
+    "use_frame_pitch": false,
+    "use_uv": false,
+    "use_audio": true,
+    "use_label": false,
+    "use_one_hot": false,
+    "train_file": "train.json",
+    "valid_file": "test.json"
+  },
+  "train": {
+    "random_seed": 114514,
+    "batch_size": 64,
+    "gradient_accumulation_step": 1,
+    "max_epoch": 1000000,
+    "save_checkpoint_stride": [
+      20
+    ],
+    "run_eval": [
+      true
+    ],
+    "sampler": {
+      "holistic_shuffle": true,
+      "drop_last": true
+    },
+    "dataloader": {
+      "num_worker": 16,
+      "pin_memory": true
+    },
+    "tracker": [
+      "tensorboard"
+    ],
+  }
+}

Amphion/egs/codec/FAcodec/README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# FAcodec
+Pytorch implementation for the training of FAcodec, which was proposed in paper [NaturalSpeech 3: Zero-Shot Speech Synthesis
+with Factorized Codec and Diffusion Models](https://arxiv.org/pdf/2403.03100)
+A dedicated repository for the FAcodec model can also be find [here](https://github.com/Plachtaa/FAcodec).
+This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including
+transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
+With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
+We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.
+## Model storage
+We provide pretrained checkpoints on 50k hours speech data.
+| Model type        | Link                                                                                                                                   |
+|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|
+| FAcodec           | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAcodec-blue)](https://huggingface.co/Plachta/FAcodec)               |
+## Demo
+Try our model on [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Plachta/FAcodecV2)!
+## Training
+Prepare your data and put them under one folder, internal file structure does not matter.
+Then, change the `dataset` in `./egs/codec/FAcodec/exp_custom_data.json` to the path of your data folder.
+Finally, run the following command:
+```bash
+sh ./egs/codec/FAcodec/train.sh
+```
+## Inference
+To reconstruct a speech file, run:
+```bash
+python ./bins/codec/inference.py --source <source_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
+```
+To use zero-shot voice conversion, run:
+```bash
+python ./bins/codec/inference.py --source <source_wav> --reference <reference_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
+```
+## Feature extraction
+When running `./bins/codec/inference.py`, check the returned results of the `FAcodecInference` class: a tuple of `(quantized, codes)`
+- `quantized` is the quantized representation of the input speech file.
+- `quantized[0]` is the quantized representation of prosody
+- `quantized[1]` is the quantized representation of content
+- `codes` is the discrete code representation of the input speech file.
+- `codes[0]` is the discrete code representation of prosody
+- `codes[1]` is the discrete code representation of content
+For the most clean content representation without any timbre, we suggest to use `codes[1][:, 0, :]`, which is the first layer of content codebooks.

Amphion/egs/codec/FAcodec/exp_custom_data.json ADDED Viewed

	@@ -0,0 +1,80 @@

+{
+  "exp_name": "facodec",
+  "model_type": "FAcodec",
+  "log_dir": "./runs/",
+  "log_interval": 10,
+  "save_interval": 1000,
+  "device": "cuda",
+  "epochs": 1000,
+  "batch_size": 4,
+  "batch_length": 100,
+  "max_len": 80,
+  "pretrained_model": "",
+  "load_only_params": false,
+  "F0_path": "modules/JDC/bst.t7",
+  "dataset": "/path/to/dataset",
+  "preprocess_params": {
+    "sr": 24000,
+    "frame_rate": 80,
+    "duration_range": [1.0, 25.0],
+    "spect_params": {
+      "n_fft": 2048,
+      "win_length": 1200,
+      "hop_length": 300,
+      "n_mels": 80
+    }
+  },
+  "train": {
+    "gradient_accumulation_step": 1,
+    "batch_size": 1,
+    "save_checkpoint_stride": [
+      20
+    ],
+    "random_seed": 1234,
+    "max_epoch": -1,
+    "max_frame_len": 80,
+    "tracker": [
+      "tensorboard"
+    ],
+    "run_eval": [
+      false
+    ],
+    "sampler": {
+      "holistic_shuffle": true,
+      "drop_last": true
+    },
+    "dataloader": {
+      "num_worker": 0,
+      "pin_memory": true
+    }
+  },
+  "model_params": {
+    "causal": true,
+    "lstm": 2,
+    "norm_f0": true,
+    "use_gr_content_f0": false,
+    "use_gr_prosody_phone": false,
+    "use_gr_timbre_prosody": false,
+    "separate_prosody_encoder": true,
+    "n_c_codebooks": 2,
+    "timbre_norm": true,
+    "use_gr_content_global_f0": true,
+    "DAC": {
+      "encoder_dim": 64,
+      "encoder_rates": [2, 5, 5, 6],
+      "decoder_dim": 1536,
+      "decoder_rates": [6, 5, 5, 2],
+      "sr": 24000
+    }
+  },
+  "loss_params": {
+    "base_lr": 0.0001,
+    "warmup_steps": 200,
+    "discriminator_iter_start": 2000,
+    "lambda_spk": 1.0,
+    "lambda_mel": 45,
+    "lambda_f0": 1.0,
+    "lambda_uv": 1.0
+  }
+}

Amphion/egs/codec/FAcodec/train.sh ADDED Viewed

	@@ -0,0 +1,27 @@

+export PYTHONPATH="./"
+######## Build Experiment Environment ###########
+exp_dir="./egs/codecs/FAcodec"
+echo exp_dir: $exp_dir
+work_dir="./" # Amphion root folder
+echo work_dir: $work_dir
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Set Config File Dir ##############
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_libritts.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+######## Set the experiment name ##########
+exp_name="facodec"
+port=53333 # a random number for port
+######## Train Model ###########
+echo "Experiment Name: $exp_name"
+accelerate launch --main_process_port $port "${work_dir}"/bins/codec/train.py --config $exp_config \
+--exp_name $exp_name --log_level debug $1

Amphion/egs/datasets/README.md ADDED Viewed

	@@ -0,0 +1,458 @@

+# Datasets Format
+Amphion support the following academic datasets (sort alphabetically):
+- [Datasets Format](#datasets-format)
+  - [AudioCaps](#audiocaps)
+  - [CSD](#csd)
+  - [CustomSVCDataset](#customsvcdataset)
+  - [Hi-Fi TTS](#hifitts)
+  - [KiSing](#kising)
+  - [LibriLight](#librilight)
+  - [LibriTTS](#libritts)
+  - [LJSpeech](#ljspeech)
+  - [M4Singer](#m4singer)
+  - [NUS-48E](#nus-48e)
+  - [Opencpop](#opencpop)
+  - [OpenSinger](#opensinger)
+  - [Opera](#opera)
+  - [PopBuTFy](#popbutfy)
+  - [PopCS](#popcs)
+  - [PJS](#pjs)
+  - [SVCC](#svcc)
+  - [VCTK](#vctk)
+The downloading link and the file structure tree of each dataset is displayed as follows.
+> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details.
+## AudioCaps
+AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information.
+Download AudioCaps dataset [here](https://github.com/cdjkim/audiocaps). The file structure looks like below:
+```plaintext
+[AudioCaps dataset path]
+┣ AudioCpas
+┃ ┣ wav
+┃ ┃ ┣ ---1_cCGK4M_0_10000.wav
+┃ ┃ ┣ ---lTs1dxhU_30000_40000.wav
+┃ ┃ ┣ ...
+```
+## CSD
+Download the official CSD dataset [here](https://zenodo.org/records/4785016). The file structure looks like below:
+```plaintext
+[CSD dataset path]
+ ┣ english
+ ┣ korean
+ ┣ utterances
+ ┃ ┣ en001a
+ ┃ ┃ ┣ {UtterenceID}.wav
+ ┃ ┣ en001b
+ ┃ ┣ en002a
+ ┃ ┣ en002b
+ ┃ ┣ ...
+ ┣ README
+```
+## CustomSVCDataset
+We support custom dataset for Singing Voice Conversion. Organize your data in the following structure to construct your own dataset:
+```plaintext
+[Your Custom Dataset Path]
+ ┣ singer1
+ ┃ ┣ song1
+ ┃ ┃ ┣ utterance1.wav
+ ┃ ┃ ┣ utterance2.wav
+ ┃ ┃ ┣ ...
+ ┃ ┣ song2
+ ┃ ┣ ...
+ ┣ singer2
+ ┣ ...
+```
+## Hi-Fi TTS
+Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below:
+```plaintext
+[Hi-Fi TTS dataset path]
+ ┣ audio
+ ┃ ┣ 11614_other {Speaker_ID}_{SNR_subset}
+ ┃ ┃ ┣ 10547 {Book_ID}
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0001.flac
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0003.flac
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0004.flac
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ 92_manifest_clean_dev.json
+ ┣ 92_manifest_clean_test.json
+ ┣ 92_manifest_clean_train.json
+ ┣ ...
+ ┣ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json
+ ┣ ...
+ ┣ books_bandwidth.tsv
+ ┣ LICENSE.txt
+ ┣ readers_books_clean.txt
+ ┣ readers_books_other.txt
+ ┣ README.txt
+```
+## KiSing
+Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:
+```plaintext
+[KiSing dataset path]
+ ┣ clean
+ ┃ ┣ 421
+ ┃ ┣ 422
+ ┃ ┣ ...
+```
+## LibriLight
+Download the official LibriLight dataset [here](https://github.com/facebookresearch/libri-light). The file structure looks like below:
+```plaintext
+[LibriTTS dataset path]
+ ┣ small (Subset)
+ ┃ ┣ 100 {Speaker_ID}
+ ┃ ┃ ┣ sea_fairies_0812_librivox_64kb_mp3 {Chapter_ID}
+ ┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.flac
+ ┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.flac
+ ┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.flac
+ ┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.flac
+ ┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.json
+ ┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.json
+ ┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.json
+ ┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.json
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ medium (Subset)
+ ┣ ...
+```
+## LibriTTS
+Download the official LibriTTS dataset [here](https://www.openslr.org/60/). The file structure looks like below:
+```plaintext
+[LibriTTS dataset path]
+ ┣ BOOKS.txt
+ ┣ CHAPTERS.txt
+ ┣ eval_sentences10.tsv
+ ┣ LICENSE.txt
+ ┣ NOTE.txt
+ ┣ reader_book.tsv
+ ┣ README_librispeech.txt
+ ┣ README_libritts.txt
+ ┣ speakers.tsv
+ ┣ SPEAKERS.txt
+ ┣ dev-clean (Subset)
+ ┃ ┣ 1272{Speaker_ID}
+ ┃ ┃ ┣ 128104 {Chapter_ID}
+ ┃ ┃ ┃ ┣ 1272_128104_000001_000000.normalized.txt
+ ┃ ┃ ┃ ┣ 1272_128104_000001_000000.original.txt
+ ┃ ┃ ┃ ┣ 1272_128104_000001_000000.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ 1272_128104.book.tsv
+ ┃ ┃ ┃ ┣ 1272_128104.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ dev-other (Subset)
+ ┃ ┣ 116 (Speaker)
+ ┃ ┃ ┣ 288045 {Chapter_ID}
+ ┃ ┃ ┃ ┣ 116_288045_000003_000000.normalized.txt
+ ┃ ┃ ┃ ┣ 116_288045_000003_000000.original.txt
+ ┃ ┃ ┃ ┣ 116_288045_000003_000000.wav
+ ��� ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ 116_288045.book.tsv
+ ┃ ┃ ┃ ┣ 116_288045.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ test-clean  (Subset)
+ ┃ ┣ {Speaker_ID}
+ ┃ ┃ ┣ {Chapter_ID}
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ test-other
+ ┃ ┣ {Speaker_ID}
+ ┃ ┃ ┣ {Chapter_ID}
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ train-clean-100
+ ┃ ┣ {Speaker_ID}
+ ┃ ┃ ┣ {Chapter_ID}
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ train-clean-360
+ ┃ ┣ {Speaker_ID}
+ ┃ ┃ ┣ {Chapter_ID}
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ train-other-500
+ ┃ ┣ {Speaker_ID}
+ ┃ ┃ ┣ {Chapter_ID}
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
+ ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+```
+## LJSpeech
+Download the official LJSpeech dataset [here](https://keithito.com/LJ-Speech-Dataset/). The file structure looks like below:
+```plaintext
+[LJSpeech dataset path]
+ ┣ metadata.csv
+ ┣ wavs
+ ┃ ┣ LJ001-0001.wav
+ ┃ ┣ LJ001-0002.wav
+ ┃ ┣ ...
+ ┣ README
+```
+## M4Singer
+Download the official M4Singer dataset [here](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view). The file structure looks like below:
+```plaintext
+[M4Singer dataset path]
+ ┣ {Singer_1}#{Song_1}
+ ┃ ┣ 0000.mid
+ ┃ ┣ 0000.TextGrid
+ ┃ ┣ 0000.wav
+ ┃ ┣ ...
+ ┣ {Singer_1}#{Song_2}
+ ┣ ...
+ ┣ {Singer_2}#{Song_1}
+ ┣ {Singer_2}#{Song_2}
+ ┣ ...
+ ┗ meta.json
+```
+## NUS-48E
+Download the official NUS-48E dataset [here](https://drive.google.com/drive/folders/12pP9uUl0HTVANU3IPLnumTJiRjPtVUMx). The file structure looks like below:
+```plaintext
+[NUS-48E dataset path]
+ ┣ {SpeakerID}
+ ┃ ┣ read
+ ┃ ┃ ┣ {SongID}.txt
+ ┃ ┃ ┣ {SongID}.wav
+ ┃ ┃ ┣ ...
+ ┃ ┣ sing
+ ┃ ┃ ┣ {SongID}.txt
+ ┃ ┃ ┣ {SongID}.wav
+ ┃ ┃ ┣ ...
+ ┣ ...
+ ┣ README.txt
+```
+## Opencpop
+Download the official Opencpop dataset [here](https://wenet.org.cn/opencpop/). The file structure looks like below:
+```plaintext
+[Opencpop dataset path]
+ ┣ midis
+ ┃ ┣ 2001.midi
+ ┃ ┣ 2002.midi
+ ┃ ┣ 2003.midi
+ ┃ ┣ ...
+ ┣ segments
+ ┃ ┣ wavs
+ ┃ ┃ ┣ 2001000001.wav
+ ┃ ┃ ┣ 2001000002.wav
+ ┃ ┃ ┣ 2001000003.wav
+ ┃ ┃ ┣ ...
+ ┃ ┣ test.txt
+ ┃ ┣ train.txt
+ ┃ ┗ transcriptions.txt
+ ┣ textgrids
+ ┃ ┣ 2001.TextGrid
+ ┃ ┣ 2002.TextGrid
+ ┃ ┣ 2003.TextGrid
+ ┃ ┣ ...
+ ┣ wavs
+ ┃ ┣ 2001.wav
+ ┃ ┣ 2002.wav
+ ┃ ┣ 2003.wav
+ ┃ ┣ ...
+ ┣ TERMS_OF_ACCESS
+ ┗ readme.md
+```
+## OpenSinger
+Download the official OpenSinger dataset [here](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view). The file structure looks like below:
+```plaintext
+[OpenSinger dataset path]
+ ┣ ManRaw
+ ┃ ┣ {Singer_1}_{Song_1}
+ ┃ ┃ ┣ {Singer_1}_{Song_1}_0.lab
+ ┃ ┃ ┣ {Singer_1}_{Song_1}_0.txt
+ ┃ ┃ ┣ {Singer_1}_{Song_1}_0.wav
+ ┃ ┃ ┣ ...
+ ┃ ┣ {Singer_1}_{Song_2}
+ ┃ ┣ ...
+ ┣ WomanRaw
+ ┣ LICENSE
+ ┗ README.md
+```
+## Opera
+Download the official Opera dataset [here](http://isophonics.net/SingingVoiceDataset). The file structure looks like below:
+```plaintext
+[Opera dataset path]
+ ┣ monophonic
+ ┃ ┣ chinese
+ ┃ ┃ ┣ {Gender}_{SingerID}
+ ┃ ┃ ┃ ┣ {Emotion}_{SongID}.wav
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┣ ...
+ ┃ ┣ western
+ ┣ polyphonic
+ ┃ ┣ chinese
+ ┃ ┣ western
+ ┣ CrossculturalDataSet.xlsx
+```
+## PopBuTFy
+Download the official PopBuTFy dataset [here](https://github.com/MoonInTheRiver/NeuralSVB). The file structure looks like below:
+```plaintext
+[PopBuTFy dataset path]
+ ┣ data
+ ┃ ┣ {SingerID}#singing#{SongName}_Amateur
+ ┃ ┃ ┣ {SingerID}#singing#{SongName}_Amateur_{UtteranceID}.mp3
+ ┃ ┃ ┣ ...
+ ┃ ┣ {SingerID}#singing#{SongName}_Professional
+ ┃ ┃ ┣ {SingerID}#singing#{SongName}_Professional_{UtteranceID}.mp3
+ ┃ ┃ ┣ ...
+ ┣ text_labels
+ ┗ TERMS_OF_ACCESS
+```
+## PopCS
+Download the official PopCS dataset [here](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md). The file structure looks like below:
+```plaintext
+[PopCS dataset path]
+ ┣ popcs
+ ┃ ┣ popcs-{SongName}
+ ┃ ┃ ┣ {UtteranceID}_ph.txt
+ ┃ ┃ ┣ {UtteranceID}_wf0.wav
+ ┃ ┃ ┣ {UtteranceID}.TextGrid
+ ┃ ┃ ┣ {UtteranceID}.txt
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┗ TERMS_OF_ACCESS
+```
+## PJS
+Download the official PJS dataset [here](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus). The file structure looks like below:
+```plaintext
+[PJS dataset path]
+ ┣ PJS_corpus_ver1.1
+ ┃ ┣ background_noise
+ ┃ ┣ pjs{SongID}
+ ┃ ┃ ┣ pjs{SongID}_song.wav
+ ┃ ┃ ┣ pjs{SongID}_speech.wav
+ ┃ ┃ ┣ pjs{SongID}.lab
+ ┃ ┃ ┣ pjs{SongID}.mid
+ ┃ ┃ ┣ pjs{SongID}.musicxml
+ ┃ ┃ ┣ pjs{SongID}.txt
+ ┃ ┣ ...
+```
+## SVCC
+Download the official SVCC dataset [here](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset). The file structure looks like below:
+```plaintext
+[SVCC dataset path]
+ ┣ Data
+ ┃ ┣ CDF1
+ ┃ ┃ ┣ 10001.wav
+ ┃ ┃ ┣ 10002.wav
+ ┃ ┃ ┣ ...
+ ┃ ┣ CDM1
+ ┃ ┣ IDF1
+ ┃ ┣ IDM1
+ ┗ README.md
+```
+## VCTK
+Download the official VCTK dataset [here](https://datashare.ed.ac.uk/handle/10283/3443). The file structure looks like below:
+```plaintext
+[VCTK dataset path]
+ ┣ txt
+ ┃ ┣ {Speaker_1}
+ ┃ ┃ ┣ {Speaker_1}_001.txt
+ ┃ ┃ ┣ {Speaker_1}_002.txt
+ ┃ ┃ ┣ ...
+ ┃ ┣ {Speaker_2}
+ ┃ ┣ ...
+ ┣ wav48_silence_trimmed
+ ┃ ┣ {Speaker_1}
+ ┃ ┃ ┣ {Speaker_1}_001_mic1.flac
+ ┃ ┃ ┣ {Speaker_1}_001_mic2.flac
+ ┃ ┃ ┣ {Speaker_1}_002_mic1.flac
+ ┃ ┃ ┣ ...
+ ┃ ┣ {Speaker_2}
+ ┃ ┣ ...
+ ┣ speaker-info.txt
+ ┗ update.txt
+```