Update README.md

eea9a07 verified 3 months ago

5.84 kB

	---
	license: cc-by-nc-4.0
	datasets:
	- facebook/multilingual_librispeech
	- parler-tts/libritts_r_filtered
	- amphion/Emilia-Dataset
	language:
	- en
	- zh
	- ja
	- ko
	pipeline_tag: text-to-speech
	---
	<style>
	table {
	border-collapse: collapse;
	width: 100%;
	margin-bottom: 20px;
	}
	th, td {
	border: 1px solid #ddd;
	padding: 8px;
	text-align: center;
	}
	.best {
	font-weight: bold;
	text-decoration: underline;
	}
	.box {
	text-align: center;
	margin: 20px auto;
	padding: 30px;
	box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);
	border-radius: 10px;
	}
	.badges {
	display: flex;
	justify-content: center;
	gap: 10px;
	flex-wrap: wrap;
	margin-top: 10px;
	}
	.badge {
	text-decoration: none;
	display: inline-block;
	padding: 4px 8px;
	border-radius: 5px;
	color: #fff;
	font-size: 12px;
	font-weight: bold;
	width: 250px;
	}
	.badge-hf-blue {
	background-color: #767b81;
	}
	.badge-hf-pink {
	background-color: #7b768a;
	}
	.badge-github {
	background-color: #2c2b2b;
	}
	</style>

	<div class="box">
	<div style="margin-bottom: 20px;">
	<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
	<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
	<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🤝 Join our Discord</a>
	<a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
	</div>
	<div class="badges">
	<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.2 500M</a>
	<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.2 500M GGUF</a>
	<a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">🤗 Hugging Face - Demo Space</a>
	<a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
	</div>
	</div>

	## Model Description

	OuteTTS-0.2-500M is our improved successor to the v0.1 release.
	The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
	Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.

	## Key Improvements

	- Enhanced Accuracy: Significantly improved prompt following and output coherence compared to the previous version
	- Natural Speech: Produces more natural and fluid speech synthesis
	- Expanded Vocabulary: Trained on over 5 billion audio prompt tokens
	- Voice Cloning: Improved voice cloning capabilities with greater diversity and accuracy
	- Multilingual Support: New experimental support for Chinese, Japanese, and Korean languages

	## Speech Demo

	<video width="1280" height="720" controls>
	<source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>

	## Usage

	### Installation

	[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)

	```bash
	pip install outetts
	```

	### Interface Usage

	```python
	import outetts

	# Configure the model
	model_config = outetts.HFModelConfig_v1(
	model_path="OuteAI/OuteTTS-0.2-500M",
	language="en", # Supported languages in v0.2: en, zh, ja, ko
	)

	# Initialize the interface
	interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

	# Optional: Create a speaker profile (use a 10-15 second audio clip)
	# speaker = interface.create_speaker(
	# audio_path="path/to/audio/file",
	# transcript="Transcription of the audio file."
	# )

	# Optional: Save and load speaker profiles
	# interface.save_speaker(speaker, "speaker.pkl")
	# speaker = interface.load_speaker("speaker.pkl")

	# Optional: Load speaker from default presets
	interface.print_default_speakers()
	speaker = interface.load_default_speaker(name="male_1")

	output = interface.generate(
	text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
	# Lower temperature values may result in a more stable tone,
	# while higher values can introduce varied and expressive speech
	temperature=0.1,
	repetition_penalty=1.1,
	max_length=4096,

	# Optional: Use a speaker profile for consistent voice characteristics
	# Without a speaker profile, the model will generate a voice with random characteristics
	speaker=speaker,
	)

	# Save the synthesized speech to a file
	output.save("output.wav")

	# Optional: Play the synthesized speech
	# output.play()
	```

	## Using GGUF Model

	```python
	# Configure the GGUF model
	model_config = outetts.GGUFModelConfig_v1(
	model_path="local/path/to/model.gguf",
	language="en", # Supported languages in v0.2: en, zh, ja, ko
	n_gpu_layers=0,
	)

	# Initialize the GGUF interface
	interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
	```

	## Model Specifications
	- Base Model: Qwen-2.5-0.5B
	- Parameter Count: 500M
	- Language Support:
	- Primary: English
	- Experimental: Chinese, Japanese, Korean
	- License: CC BY NC 4.0

	## Training Datasets
	- Emilia-Dataset (CC BY NC 4.0)
	- LibriTTS-R (CC BY 4.0)
	- Multilingual LibriSpeech (MLS) (CC BY 4.0)

	## Credits & References
	- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
	- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
	- [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)