Emotional Profile
Phenomenon of obtaining Affective TTS voices when using 4x speed
artificial styles in StyleTTS2.
- This phenomenon is utilised in demo
- StyleTTS2 sounds cooler if given 4x speed audio as style rather than actual natural speech of 1x speed
|
|
Mimic-3 English Mimic-3 English - (Harvard) |
StyleTTS - (Mimic-3 English) |
Mimic-3 English 4x Mimic-3 En. 4x - (Harvard) |
StyleTTS2 - (Mimic-3 English 4x) |
Human - (EmoDB) Human - (LibriSpeech) |
StyleTTS2 - (EmoDB) StyleTTS2 - (LibriSpeech) |
Mimic-3 Foreign Mimic-3 Foreign - (Harvard) |
StyleTTS2 - (Mimic-3 Foreign) |
Mimix-3 Foreign 4x Mimic-3 Foreign 4x - (Harvard) |
StyleTTS2 - (Mimic-3 Foreign 4x) |
Achieving high naturalness in speech synthesis by cloning (artificial) synthetic speakers might seem counterintuitive. However, natural speech has environmental noise that adulterates voice cloning.
Prosody of TTS can be controlled via a audio file (speaker) or via text description. Audio filEs allow for precise indication of prosody. However are polluted by background noise. Text is inherently limited by non-descriptiveness of non-verbal paralinguistics by words. SHIFT TTS focus is an in between zone: What if we use synthetic (TTS audio) speakers instead of natural speech, to drive a second TTS (StyleTTS2) algorithm.
Our intuition is that synthetic speech voice cloning is an interesting form of TTS which enables pleasant narrations of long Txt.
Character Error Rate
Naturalness MOS and CER (%) for StyleTTS2 having human speech (EmoDB) as style vs having synthetic speech (Mimic-3) as style.
NMOS / CER (%) | ||||
---|---|---|---|---|
Style Audio | StyleTTS 2 | |||
MOS | CER (%) | MOS | CER (%) | |
Mimic-3 English 1x speed | 2.9 | 0.92 | 3.6 | 0.72 |
Mimic-3 English 4x speed | 3.7 | 12.90 | 4.1 | 0.59 |
Mimic-3 Foreign 1x speed | 1.7 | 62.63 | 3.4 | 0.85 |
Mimic-3 Foreign 4x speed | 2.7 | 82.67 | 4.0 | 1.15 |
EmoDB | 4.7 | 77.81 | 4.0 | 1.15 |
Emotional Profile of StyleTTS2 using TTS styles
we apply Speech Emotion Recognition (SER) to observe the emotional profile of StyleTTS2 in synthesizing the 720 Harvard sentences. We use two publicly available SER detectors: One for Arousal, Dominance, and Valence (A/D/V) wav2small and one for categorical emotions WavLM MSP . The WavLM outputs probabilities of the emotional categories of Happy, Anger, Sad, Fear, Disgust, among others. Arousal indicates voice excitement, Dominance shows how imposing a voice is, and Valence reveals negativity / positivity.
As generator of synthetic speech styles, we use the Mimic-3 TTS system that provides 134 English voices and 204 Foreign voices. Our Artificial StyleTTS2 along with pre-generated styles is also available in https://audeering.github.io/shift/.
Visualizations
We synthesize the 720 Harvard sentences (in standard order) via StyleTTS2, using five different choices of style audio:
Mimic-3 English Style of 1x or 4x speed
We use styles of 1x or 4x speed to see their effect on prosody manipulation. The speed is inherently generated by Mimic-3 avoiding artefacts of post-processing. We synthesize a different style for each Harvard Sentence using a different voice of Mimic-3. Audio examples are given above.
Mimic-3 Foreign Style of 1x or 4x speed
The Foreign voices of Mimic-3 can pronounce the Harvard Sentences, although with an accent. We use them as styles that are from diverse languages not seen during training of StyleTTS2. Again, we generate foreign styles of 1x speed, 4x speed.
Natural Speech style
Grey shadow is the emotional levels of StyleTTS using natural speech styles taken from EmoDB. EmoDB is a corpus of (acted) highly expressive noise-free natural speech.
Figures
Notice the increase of emotion probabilities when feeding 4x speed
style .wav to StyleTTS2, generated by speed up in Mimic-3 TTS.
Figures above show the probability of emotions as well as the level of Arousal/Dominance/Valence detected at the output of StyleTTS2 over the course of 720 Harvard sentences, with different styles.
Probabilities of emotion appear similar for different styles, due to same text, as text sentiment overwhelms SER detectors. Emotion Looks at Text or Voice ?
Higher MOS via TTS than Natural Speech: StyleTTS2 achieves MOS = 4.1 & lower CER = 0.59% by using Mimic-3 English 4x speed TTS audio as styles rather than natural speech style: EmoDB
or LibriSpeech
show worse MOS. Subtle differences between Figures -Left / -Right, as the rise of valence, show the tonality variation brought by the use of 4x speed style. Different style yields slightly different duration, causing a misalignment of Grey and Blue lines. The high MOS is proportional to high Valence and Happy probability (blue line). StyleTTS2 is not affected by the intelligibility of style, and achieves very low CER = 1.15%
for natural Speech and even lower CER = 0.59%
for Mimic-3 English 4x speed. Disgust / Anger is diminished when using Mimic-3 4x speed styles. Valence and Happiness for StyleTTS2 via Mimic-3 (blue line at 4x speed) is almost always higher than StyleTTS2 via EmoDB (natural speech) irrespective of the language. Actual words in style audio do not affect the valence of StyleTTS2, however, extra punctuation placed in the text of style such as ...!!!;" produces un-natural noises by Mimic-3 like
hah/hiss/scratch" sounds. When those noises are fed to StyleTTS2, they trigger it to generate audible backchannels, such as sighs and breaths that are pleasant to hear. Listen to above audios! We run StyleTTS2 using default embedding style and pitch curve calculation. Mimic-3 styles are also synthesized using default settings of Mimic-3 except for the variation of speed to 1x / 4x.
CONCLUSION
We discovered that synthetic speech styles amplify valence and happy emotion in the output of StyleTTS2. We also found out that accelerated Mimic-3 synthetic speech style, increases MOS = 4.1 vs MOS = 4.0 achieved by the use of human speech style.