Yuvraj Sharma commited on
Commit
b2fcaf7
·
1 Parent(s): e78b910

Initial commit of Gradio demo + supporting files

Browse files
Files changed (4) hide show
  1. README.md +11 -160
  2. app.py +272 -0
  3. packages.txt +2 -0
  4. requirements.txt +5 -0
README.md CHANGED
@@ -1,163 +1,14 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- base_model:
6
- - yl4579/StyleTTS2-LJSpeech
7
- pipeline_tag: text-to-speech
 
 
 
 
8
  ---
9
- 📣 Jan 12 Status: Intent to improve the base model https://hf.co/hexgrad/Kokoro-82M/discussions/36
10
 
11
- ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
12
-
13
- 🚨 Got Synthetic Data? Want Trained Voicepacks? See https://hf.co/posts/hexgrad/418806998707773
14
-
15
- <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
16
-
17
- **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
18
-
19
- On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a `.onnx` version of v0.19 is available.
20
-
21
- In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/hexgrad/Kokoro-82M#evaluation). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
22
- 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio**
23
- 2. XTTS v2: 467M, CPML, >10k hours
24
- 3. Edge TTS: Microsoft, proprietary
25
- 4. MetaVoice: 1.2B, Apache, 100k hours
26
- 5. Parler Mini: 880M, Apache, 45k hours
27
- 6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
28
-
29
- Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
30
-
31
- You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
32
-
33
- ### Usage
34
-
35
- The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
36
- ```py
37
- # 1️⃣ Install dependencies silently
38
- !git lfs install
39
- !git clone https://huggingface.co/hexgrad/Kokoro-82M
40
- %cd Kokoro-82M
41
- !apt-get -qq -y install espeak-ng > /dev/null 2>&1
42
- !pip install -q phonemizer torch transformers scipy munch
43
-
44
- # 2️⃣ Build the model and load the default voicepack
45
- from models import build_model
46
- import torch
47
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
48
- MODEL = build_model('kokoro-v0_19.pth', device)
49
- VOICE_NAME = [
50
- 'af', # Default voice is a 50-50 mix of Bella & Sarah
51
- 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
52
- 'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
53
- 'af_nicole', 'af_sky',
54
- ][0]
55
- VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
56
- print(f'Loaded voice: {VOICE_NAME}')
57
-
58
- # 3️⃣ Call generate, which returns 24khz audio and the phonemes used
59
- from kokoro import generate
60
- text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
61
- audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
62
- # Language is determined by the first letter of the VOICE_NAME:
63
- # 🇺🇸 'a' => American English => en-us
64
- # 🇬🇧 'b' => British English => en-gb
65
-
66
- # 4️⃣ Display the 24khz audio and print the output phonemes
67
- from IPython.display import display, Audio
68
- display(Audio(data=audio, rate=24000, autoplay=True))
69
- print(out_ps)
70
- ```
71
- If you have trouble with `espeak-ng`, see this [github issue](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1540885186). [Mac users also see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#677435d3d8ace1de46071489), and [Windows users see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#67742594fdeebf74f001ecfc).
72
-
73
- For ONNX usage, see [#14](https://huggingface.co/hexgrad/Kokoro-82M/discussions/14).
74
-
75
- ### Model Facts
76
-
77
- No affiliation can be assumed between parties on different lines.
78
-
79
- **Architecture:**
80
- - StyleTTS 2: https://arxiv.org/abs/2306.07691
81
- - ISTFTNet: https://arxiv.org/abs/2203.02395
82
- - Decoder only: no diffusion, no encoder release
83
-
84
- **Architected by:** Li et al @ https://github.com/yl4579/StyleTTS2
85
-
86
- **Trained by**: `@rzvzn` on Discord
87
-
88
- **Supported Languages:** American English, British English
89
-
90
- **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
91
-
92
- ### Releases
93
- - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
94
- - 26 Dec 2024: `am_adam`, `am_michael`
95
- - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
96
- - 30 Dec 2024: `af_nicole`
97
- - 31 Dec 2024: `af_sky`
98
- - 2 Jan 2025: ONNX v0.19 `ebef4245`
99
-
100
- ### Licenses
101
- - Apache 2.0 weights in this repository
102
- - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
103
- - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
104
-
105
- The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro. Original models published by the paper author can be found at [hf.co/yl4579](https://huggingface.co/yl4579).
106
-
107
- ### Evaluation
108
-
109
- **Metric:** Elo rating
110
-
111
- **Leaderboard:** [hf.co/spaces/Pendrokar/TTS-Spaces-Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)
112
-
113
- ![TTS-Spaces-Arena-25-Dec-2024](demo/TTS-Spaces-Arena-25-Dec-2024.png)
114
-
115
- The voice ranked in the Arena is a 50-50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as `af.pt`, but you can trivially reproduce it like this:
116
-
117
- ```py
118
- import torch
119
- bella = torch.load('voices/af_bella.pt', weights_only=True)
120
- sarah = torch.load('voices/af_sarah.pt', weights_only=True)
121
- af = torch.mean(torch.stack([bella, sarah]), dim=0)
122
- assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
123
- ```
124
-
125
- ### Training Details
126
-
127
- **Compute:** Kokoro v0.19 was trained on A100 80GB vRAM instances for approximately 500 total GPU hours. The average cost for each GPU hour was around $0.80, so the total cost was around $400.
128
-
129
- **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
130
- - Public domain audio
131
- - Audio licensed under Apache, MIT, etc
132
- - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>
133
- [1] https://copyright.gov/ai/ai_policy_guidance.pdf<br/>
134
- [2] No synthetic audio from open TTS models or "custom voice clones"
135
-
136
- **Epochs:** Less than **20 epochs**
137
-
138
- **Total Dataset Size:** Less than **100 hours** of audio
139
-
140
- ### Limitations
141
-
142
- Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
143
- - [Data] Lacks voice cloning capability, likely due to small <100h training set
144
- - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
145
- - [Data] Training dataset is mostly long-form reading and narration, not conversation
146
- - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
147
- - [Data] Multilingual capability is architecturally feasible, but training data is mostly English
148
-
149
- Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
150
-
151
- **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
152
-
153
- ### Acknowledgements
154
- - [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2
155
- - [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena
156
-
157
- ### Model Card Contact
158
-
159
- `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
160
-
161
- <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
162
-
163
- https://terminator.fandom.com/wiki/Kokoro
 
1
  ---
2
+ title: Make Custom Voices With KokoroTTS
3
+ emoji:
4
+ colorFrom: blue
5
+ colorTo: yellow
6
+ sdk: gradio
7
+ sdk_version: 5.12.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: Make Custom Voices With KokoroTTS
12
  ---
 
13
 
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ import os
4
+ from kokoro import generate
5
+ from models import build_model
6
+
7
+ # Initialize model and device
8
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
9
+ MODEL = build_model('kokoro-v0_19.pth', device)
10
+
11
+ # Load the voice models
12
+ voices = {
13
+ 'af': torch.load("voices/af.pt", weights_only=True),
14
+ 'af_bella': torch.load("voices/af_bella.pt", weights_only=True),
15
+ 'af_sarah': torch.load("voices/af_sarah.pt", weights_only=True),
16
+ 'am_adam': torch.load("voices/am_adam.pt", weights_only=True),
17
+ 'am_michael': torch.load("voices/am_michael.pt", weights_only=True),
18
+ 'bf_emma': torch.load("voices/bf_emma.pt", weights_only=True),
19
+ 'bf_isabella': torch.load("voices/bf_isabella.pt", weights_only=True),
20
+ 'bm_george': torch.load("voices/bm_george.pt", weights_only=True),
21
+ 'bm_lewis': torch.load("voices/bm_lewis.pt", weights_only=True),
22
+ 'af_nicole': torch.load("voices/af_nicole.pt", weights_only=True),
23
+ 'af_sky': torch.load("voices/af_sky.pt", weights_only=True)
24
+ }
25
+
26
+
27
+ custom_css = """
28
+ .container-wrap {
29
+ display: flex !important;
30
+ gap: 5px !important;
31
+ }
32
+
33
+ .vert-group {
34
+ min-width: 80px !important;
35
+ width: 90px !important;
36
+ flex: 0 0 auto !important;
37
+ }
38
+
39
+ .vert-group label {
40
+ white-space: nowrap !important;
41
+ overflow: visible !important;
42
+ width: auto !important;
43
+ font-size: 0.8em !important;
44
+ transform-origin: left center !important;
45
+ transform: rotate(0deg) translateX(-50%) !important;
46
+ position: relative !important;
47
+ left: 50% !important;
48
+ display: inline-block !important;
49
+ text-align: center !important;
50
+ margin-bottom: 5px !important;
51
+ }
52
+
53
+ .vert-group .wrap label {
54
+ text-align: center !important;
55
+ width: 100% !important;
56
+ display: block !important;
57
+ }
58
+
59
+ .slider_input_container {
60
+ height: 200px !important;
61
+ position: relative !important;
62
+ width: 40px !important;
63
+ margin: 0 auto !important;
64
+ overflow: hidden !important;
65
+ }
66
+
67
+ ::-webkit-scrollbar {
68
+ display: none !important;
69
+ }
70
+
71
+ * {
72
+ -ms-overflow-style: none !important;
73
+ scrollbar-width: none !important;
74
+ }
75
+
76
+ .slider_input_container input[type="range"] {
77
+ position: absolute !important;
78
+ width: 200px !important;
79
+ left: -80px !important;
80
+ top: 100px !important;
81
+ transform: rotate(90deg) !important;
82
+ }
83
+
84
+ .min_value {
85
+ position: absolute !important;
86
+ bottom: 0 !important;
87
+ left: 10px !important;
88
+ }
89
+
90
+ .max_value {
91
+ position: absolute !important;
92
+ top: 0 !important;
93
+ left: 10px !important;
94
+ }
95
+
96
+ .tab-like-container {
97
+ transform: scale(0.8) !important;
98
+ }
99
+
100
+ .gradio-row, .gradio-column {
101
+ background: none !important;
102
+ border: none !important;
103
+ min-width: unset !important;
104
+ }
105
+ """
106
+
107
+
108
+ def parse_voice_formula(formula):
109
+ """Parse the voice formula string and return the combined voice tensor."""
110
+ if not formula.strip():
111
+ raise ValueError("Empty voice formula")
112
+
113
+ # Initialize the weighted sum
114
+ weighted_sum = None
115
+
116
+ # Split the formula into terms
117
+ terms = formula.split('+')
118
+
119
+ for term in terms:
120
+ # Parse each term (format: "0.333 * voice_name")
121
+ weight, voice_name = term.strip().split('*')
122
+ weight = float(weight.strip())
123
+ voice_name = voice_name.strip()
124
+
125
+ # Get the voice tensor
126
+ if voice_name not in voices:
127
+ raise ValueError(f"Unknown voice: {voice_name}")
128
+
129
+ voice_tensor = voices[voice_name]
130
+
131
+ # Add to weighted sum
132
+ if weighted_sum is None:
133
+ weighted_sum = weight * voice_tensor
134
+ else:
135
+ weighted_sum += weight * voice_tensor
136
+
137
+ return weighted_sum
138
+
139
+ def get_new_voice(formula):
140
+ try:
141
+ # Parse the formula and get the combined voice tensor
142
+ weighted_voices = parse_voice_formula(formula)
143
+
144
+ # Save and load the combined voice
145
+ torch.save(weighted_voices, "weighted_normalised_voices.pt")
146
+ VOICEPACK = torch.load("weighted_normalised_voices.pt", weights_only=False).to(device)
147
+ return VOICEPACK
148
+ except Exception as e:
149
+ raise gr.Error(f"Failed to create voice: {str(e)}")
150
+
151
+ def text_to_speech(text, formula):
152
+ try:
153
+ if not text.strip():
154
+ raise gr.Error("Please enter some text")
155
+
156
+ if not formula.strip():
157
+ raise gr.Error("Please select at least one voice")
158
+
159
+ # Get the combined voice
160
+ VOICEPACK = get_new_voice(formula)
161
+
162
+ # Generate audio
163
+ audio, phonemes = generate(MODEL, text, VOICEPACK, lang='a')
164
+ return (24000, audio)
165
+ except Exception as e:
166
+ raise gr.Error(f"Failed to generate speech: {str(e)}")
167
+
168
+
169
+ with gr.Blocks(css=custom_css, theme="ocean") as demo:
170
+ with gr.Row(variant="default", equal_height=True, elem_classes="container-wrap"):
171
+ checkboxes = []
172
+ sliders = []
173
+
174
+ # Define slider configurations
175
+ slider_configs = [
176
+ ("af", "af"), ("af_bella", "af_bella"), ("af_sarah", "af_sarah"),
177
+ ("af_nicole", "af_nicole"), ("af_sky", "af_sky"), ("am_adam", "am_adam"),
178
+ ("am_michael", "am_michael"), ("bf_emma", "bf_emma"),
179
+ ("bf_isabella", "bf_isabella"), ("bm_george", "bm_george"),
180
+ ("bm_lewis", "bm_lewis")
181
+ ]
182
+
183
+ # Create columns for each slider
184
+ for label, name in slider_configs:
185
+ with gr.Column(min_width=70, scale=1, variant="default", elem_classes="vert-group"):
186
+ checkbox = gr.Checkbox(label='')
187
+ slider = gr.Slider(label=name, minimum=0, maximum=1, interactive=False, value=0, step=0.01)
188
+ checkboxes.append(checkbox)
189
+ sliders.append(slider)
190
+
191
+ # Add voice combination formula display
192
+ with gr.Row(equal_height=True):
193
+ formula_display = gr.Textbox(label="Voice Combination Formula", value="", lines=2, scale=4)
194
+ input_text = gr.Textbox(label="Input Text", placeholder="Enter text to convert to speech", lines=2, scale=4)
195
+ button_tts = gr.Button("Generate Voice", scale=2, min_width=100)
196
+
197
+ # Generate speech from the selected custom voice
198
+ with gr.Row(equal_height=True):
199
+ kokoro_tts = gr.Audio(label="Generated Speech", type="numpy")
200
+
201
+ def generate_voice_formula(*values):
202
+ """
203
+ Generate a formatted string showing the normalized voice combination.
204
+ Returns: String like "0.6 * voice1 + 0.4 * voice2"
205
+ """
206
+ n = len(values) // 2
207
+ checkbox_values = values[:n]
208
+ slider_values = list(values[n:])
209
+
210
+ # Get active sliders and their names
211
+ active_pairs = [(slider_values[i], slider_configs[i][1])
212
+ for i in range(len(slider_configs))
213
+ if checkbox_values[i] and slider_values[i] > 0]
214
+
215
+ if not active_pairs:
216
+ return ""
217
+
218
+ # Calculate sum for normalization
219
+ total_sum = sum(value for value, _ in active_pairs)
220
+
221
+ if total_sum == 0:
222
+ return ""
223
+
224
+ # Generate normalized formula
225
+ terms = []
226
+ for value, name in active_pairs:
227
+ normalized_value = value / total_sum
228
+ terms.append(f"{normalized_value:.3f} * {name}")
229
+
230
+ return " + ".join(terms)
231
+
232
+ def check_box(checkbox):
233
+ """Handle checkbox changes."""
234
+ if checkbox:
235
+ return gr.Slider(interactive=True, value=0.5)
236
+ else:
237
+ return gr.Slider(interactive=False, value=0)
238
+
239
+ # Connect all checkboxes and sliders
240
+ all_inputs = checkboxes + sliders
241
+
242
+ # Update on checkbox changes
243
+ for checkbox, slider in zip(checkboxes, sliders):
244
+ checkbox.change(
245
+ fn=check_box,
246
+ inputs=[checkbox],
247
+ outputs=[slider]
248
+ )
249
+
250
+ # Update formula on checkbox changes
251
+ checkbox.change(
252
+ fn=generate_voice_formula,
253
+ inputs=all_inputs,
254
+ outputs=[formula_display]
255
+ )
256
+
257
+ # Update formula on slider changes
258
+ for slider in sliders:
259
+ slider.change(
260
+ fn=generate_voice_formula,
261
+ inputs=all_inputs,
262
+ outputs=[formula_display]
263
+ )
264
+
265
+ button_tts.click(
266
+ fn=text_to_speech,
267
+ inputs=[input_text, formula_display,],
268
+ outputs=[kokoro_tts]
269
+ )
270
+
271
+ if __name__ == "__main__":
272
+ demo.launch()
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ espeak-ng
2
+ git-lfs
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch
2
+ transformers
3
+ scipy
4
+ munch
5
+ phonemizer