Using COQUI-AI TTS repo https://github.com/coqui-ai/TTS the default spanish VITS model is in .tar format, so I used the finetuning functionality but with only 1 epoch and all learning rates in 0.0 - Clone repo - follow instructions for installation, currently doing cd /repo/path && pip install -e .[all,dev,notebooks] - Generate any text with vits spanish model to download the model, currenty: tts --model_name tts/es/cs100/vits --text "hola hola hola" --out_path /anywhere/ - Go to directory of downloaded model, normally /user/.local/tts/vits - copy config.json and model.pth.tar - customize config.json to train only 1 epoch and all lr to 0.0 (including lr_gen & lr_disc) and settings for your database, even when you don't want to train, you need some dataset to act as if it was going to train - finetune, currently: CUDA_VISIBLE_DEVICES="0" python /path/to/repo/TTS/TTS/bin/train_tts.py --config_path /path/to/custom/config/config.json --restore_path /path/to/model/model_file.pth.tar --use_cuda True Some troubles I found and how I solved them: - If training throws error of "Vits has no disc", on confing.json set initialize_disc to true. - Always needs at least one file for evaluation, so set eval_split_size so that it gets at least one file, for me it crashed when using no evaluation. - If throws some error about no more data, check the data filters in config.json, such as min and max length. It could be that all your audio and/or text data is being filtered out.