Finetuned Voices Not Working for Me - Docker
First of all you are a legend and actually bringing a dream of mine to life so thanks for your work!
The default voice works great for me but I am struggling to get the fine tuned models you have available to work. I tried pasting the link as well as downloading and adding in the files manually. But it either gets stuck at the final stage (never produces an output) or just uses the default voice. Should I be uploading a WAV file to the target voice file section as well as adding the appropriate files to the custom model section?
Thanks!
Docker - Windows 10 - RTX 3090
You should be uploading a WAV file to the target voice file section as well as the custom model section.
Use the included
ref.wav
as thetarget voice file
https://huggingface.co/drewThomasson/Xtts-Finetune-Bryan-Cranston/blob/main/V2_Xtts-Finetune-Bryan-Cranston/ref_audio_for_v2.wavThe
ref.wav
is needed because xtts is a voice-cloning model and by fine-tuning it on a voice it just makes it a LOT better at cloning a specific voice.
When using this model you should be pasting the direct download link to the finished_model_files.zip
into the link field for ebook2audiobookxtts
For example this model would be this link to paste for V2--->
https://huggingface.co/drewThomasson/Xtts-Finetune-Bryan-Cranston/resolve/main/V2_Xtts-Finetune-Bryan-Cranston/Finished_model_files.zip?download=true
Thank you!!
Also what did you mean by But it either gets stuck at the final stage (never produces an output)
?
Like, it gets stuck at the final combining audiobook stage?
Do you not see the output in the web gui when you click the Download Audiobook Files
button?
If your trying to generate the same book again without re-launching the docker image to reset it then:
You might need to check the terminal to see what its outputting cause it might be asking you if want to over-write the old audiobook with the new one?
you'll have to respond with y/n in the terminal if thats the case lol
If you don't tell if what to do the it'll just keep waiting for your response forever lol
That's correct, the "90%" portion, I forget exactly what the terminal was saying at that point but it was consistent. No output in the web gui
lol It wasn't giving the over-write prompt (I got that one before). I will let you know if it happens again.
Hm, well if you get the It getting stuck
issue again then send a screenshot of what the web gui looks like and what the terminal also looks like or what it's saying is going on lol π
Well I can confirm that the model works in it on my end at least lol
Just used it to make Bryan Cranston read the "a tell tale heart" short story
Working now!
Any reason why some voices are more consistent than others? Bryan Cranston seems to have less defects than David Walliams, with the default female speaker has the least hallucinations of all three.
Is there a speaker you know of that has the most reliable output? I am using the same text as input btw (1 paragraph)
HMMMMM I mostly forgot tbh
But it also depends on the input ref audio?, idk noise and stuff in it I guess?
It's mostly an art of just messing around with them
Perhaps in the future I'll find some automated way to determine how much each most hallucinates using whisper and such and put those ratings on the model README
The David Attenborough is pretty good also lol
Example David reading tell tale hearts
Also:
If you go into the audio_generation_settings
in the gradio interface you can turn the temperature all the way down, that should make it hallucinate less.
Nice, yeah from the ones I have tested, Attenborough and the original voice seem to generate the best results.
Any correlation between model size and quality I wonder?
Thanks for the tip on adjusting the temperature, my next step is to mess around with all of those settings!
what do you mean model size? π€¨
All of these are just a fine-tuned xtts v2 models, the parameter count never changes lol
Oh my bad I meant the dataset size. It seemed like there was a relationship between the dataset size and the quality of the output but that could be completely coincidental. The quality of the data might be more important the size. But the issues I was running into seemed different than having "bad" audio for training. I would much rather have subpar audio than glitched audio (hallucinations/artifacts), which was the issue I was running into
Is the base model (the default one) xtts v2 without any fine tuning? Or are you applying a fine tune to that one?
The base model isn't fine-tuned its just normal
The quality of a fine-tuned appears to be directly associated with:
(dataset size, with a max of 40 minutes is what I have found to be good, once you have training datasets larger than that for a single voice then the model becomes "Overfit" leading to more weird hallucinations)
Dataset quality: So like making sure the audio is clean and there isn't any other background sounds and not being noisesy so denoise it
Dataset quality pitch wise: Seeems that normalizing the training input also helps cause if the volume fluxuates too much on the voice your training it on then the model get's issues from that.