|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- facebook/multilingual_librispeech |
|
- parler-tts/libritts_r_filtered |
|
- amphion/Emilia-Dataset |
|
language: |
|
- en |
|
- zh |
|
- ja |
|
- ko |
|
pipeline_tag: text-to-speech |
|
--- |
|
<style> |
|
table { |
|
border-collapse: collapse; |
|
width: 100%; |
|
margin-bottom: 20px; |
|
} |
|
th, td { |
|
border: 1px solid #ddd; |
|
padding: 8px; |
|
text-align: center; |
|
} |
|
.best { |
|
font-weight: bold; |
|
text-decoration: underline; |
|
} |
|
.box { |
|
text-align: center; |
|
margin: 20px auto; |
|
padding: 30px; |
|
box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05); |
|
border-radius: 10px; |
|
} |
|
.badges { |
|
display: flex; |
|
justify-content: center; |
|
gap: 10px; |
|
flex-wrap: wrap; |
|
margin-top: 10px; |
|
} |
|
.badge { |
|
text-decoration: none; |
|
display: inline-block; |
|
padding: 4px 8px; |
|
border-radius: 5px; |
|
color: #fff; |
|
font-size: 12px; |
|
font-weight: bold; |
|
width: 250px; |
|
} |
|
.badge-hf-blue { |
|
background-color: #767b81; |
|
} |
|
.badge-hf-pink { |
|
background-color: #7b768a; |
|
} |
|
.badge-github { |
|
background-color: #2c2b2b; |
|
} |
|
</style> |
|
|
|
<div class="box"> |
|
<div style="margin-bottom: 20px;"> |
|
<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2> |
|
<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">π OuteAI.com</a> |
|
<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">π€ Join our Discord</a> |
|
<a href="https://x.com/OuteAI" target="_blank">π @OuteAI</a> |
|
</div> |
|
<div class="badges"> |
|
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M</a> |
|
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M GGUF</a> |
|
<a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">π€ Hugging Face - Demo Space</a> |
|
<a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a> |
|
</div> |
|
</div> |
|
|
|
## Model Description |
|
|
|
OuteTTS-0.2-500M is our improved successor to the v0.1 release. |
|
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. |
|
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance. |
|
|
|
## Key Improvements |
|
|
|
- **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version |
|
- **Natural Speech**: Produces more natural and fluid speech synthesis |
|
- **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens |
|
- **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy |
|
- **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages |
|
|
|
## Speech Demo |
|
|
|
<video width="1280" height="720" controls> |
|
<source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS) |
|
|
|
```bash |
|
pip install outetts |
|
``` |
|
|
|
### Interface Usage |
|
|
|
```python |
|
import outetts |
|
|
|
# Configure the model |
|
model_config = outetts.HFModelConfig_v1( |
|
model_path="OuteAI/OuteTTS-0.2-500M", |
|
language="en", # Supported languages in v0.2: en, zh, ja, ko |
|
) |
|
|
|
# Initialize the interface |
|
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config) |
|
|
|
# Optional: Create a speaker profile (use a 10-15 second audio clip) |
|
# speaker = interface.create_speaker( |
|
# audio_path="path/to/audio/file", |
|
# transcript="Transcription of the audio file." |
|
# ) |
|
|
|
# Optional: Save and load speaker profiles |
|
# interface.save_speaker(speaker, "speaker.pkl") |
|
# speaker = interface.load_speaker("speaker.pkl") |
|
|
|
# Optional: Load speaker from default presets |
|
interface.print_default_speakers() |
|
speaker = interface.load_default_speaker(name="male_1") |
|
|
|
output = interface.generate( |
|
text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.", |
|
# Lower temperature values may result in a more stable tone, |
|
# while higher values can introduce varied and expressive speech |
|
temperature=0.1, |
|
repetition_penalty=1.1, |
|
max_length=4096, |
|
|
|
# Optional: Use a speaker profile for consistent voice characteristics |
|
# Without a speaker profile, the model will generate a voice with random characteristics |
|
speaker=speaker, |
|
) |
|
|
|
# Save the synthesized speech to a file |
|
output.save("output.wav") |
|
|
|
# Optional: Play the synthesized speech |
|
# output.play() |
|
``` |
|
|
|
## Using GGUF Model |
|
|
|
```python |
|
# Configure the GGUF model |
|
model_config = outetts.GGUFModelConfig_v1( |
|
model_path="local/path/to/model.gguf", |
|
language="en", # Supported languages in v0.2: en, zh, ja, ko |
|
n_gpu_layers=0, |
|
) |
|
|
|
# Initialize the GGUF interface |
|
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config) |
|
``` |
|
|
|
## Model Specifications |
|
- **Base Model**: Qwen-2.5-0.5B |
|
- **Parameter Count**: 500M |
|
- **Language Support**: |
|
- Primary: English |
|
- Experimental: Chinese, Japanese, Korean |
|
- **License**: CC BY NC 4.0 |
|
|
|
## Training Datasets |
|
- Emilia-Dataset (CC BY NC 4.0) |
|
- LibriTTS-R (CC BY 4.0) |
|
- Multilingual LibriSpeech (MLS) (CC BY 4.0) |
|
|
|
## Credits & References |
|
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) |
|
- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html) |
|
- [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) |