As part of some ongoing work, I'm releasing the currently biggest collection of docker containers for state-of-the-art voice cloning TTS systems. https://github.com/ttsds/datasets
Alongside there is also a nice overview of all systems (see below)
Takeaways: - TorToiSe does very well, falling into second place after StyleTTS 2, which is also ranked first in the human evaluation at TTS-AGI/TTS-Arena. - MetaVoice-1B's overall score is dragged down by its Intelligibility Score (probably due to utterances being cut short), it achieves #3 in Speaker Score, which indicates good voice cloning ability. - HierSpeech++ lands in the middle of the road in terms of performance, but excels at the Environment Score, achieving #2 - this means the model is especially good at modeling recording conditions such as microphone and background noise. - The Amphion models, possibly due to not being trained for the same amount as in the papers, achieve relatively low scores. However, they seem to struggle for different reasons. The autoregressive VALLE models have low Intelligibility Scores (possibly due to "babbling" or early stop tokens) while NaturalSpeech2 has low Speaker and Prosody scores.
Since new TTS (Text-to-Speech) systems are coming out what feels like every day, and it's currently hard to compare them, my latest project has focused on doing just that.
I was inspired by the TTS-AGI/TTS-Arena (definitely check it out if you haven't), which compares recent TTS system using crowdsourced A/B testing.
I wanted to see if we can also do a similar evaluation with objective metrics and it's now available here: ttsds/benchmark Anyone can submit a new TTS model, and I hope this can provide a way to get some information on which areas models perform well or poorly in.