Spaces:
Running
Running
File size: 8,205 Bytes
349588f 38a2004 349588f 3ee8f12 349588f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
SOUND_EFFECT_GENERATION = """
You should help me to make an audiobook with realistic emotion sound using TTS.
You are tasked with generating a description of sound effects
that matches the atmosphere, actions, and tone of a given sentence or text from a book.
The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
The generated sound description must evoke the scene
or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
and it should be succinct and fit the mood of the text.
Additionally, you should include the following parameters in your response:
Text: A generated description of the sound that matches the text provided.
Keep the description simple and effective to capture the soundscape.
This text will be converted into a sound effect.
Duration_seconds: The appropriate duration of the sound effect,
which should be calculated based on the length and nature of the scene.
Cap this duration at 22 seconds. But be carefully, for very long text in input make a long sound effect,
for small make a small one. And the duration should be similar to duration of input text
Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
use a value around 0.3. For more specific or detailed sound scenes
(e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
Your output should be in the following JSON format:
{
"text": "A soft breeze rustling through leaves, distant birds chirping.",
"duration_seconds": 4.0,
"prompt_influence": 0.4
}
"""
SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION = """
You should help me to make an audiobook with realistic emotion sound using TTS.
You are tasked with generating a description of sound effects
that matches the atmosphere, actions, and tone of a given sentence or text from a book.
The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
The generated sound description must evoke the scene
or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
and it should be succinct and fit the mood of the text.
Additionally, you should include the following parameters in your response:
Text: A generated description of the sound that matches the text provided.
Keep the description simple and effective to capture the soundscape.
This text will be converted into a sound effect.
Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
use a value around 0.3. For more specific or detailed sound scenes
(e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
Your output should be in the following JSON format:
{
"text": "A soft breeze rustling through leaves, distant birds chirping.",
"prompt_influence": 0.4
}
"""
TEXT_MODIFICATION = """
You should help me to make an audiobook with realistic emotion-based voice using TTS.
You are tasked with adjusting the emotional tone of a given text
by modifying the text with special characters such as "!", "...", "-", "~",
and uppercase words to add emphasis or convey emotion. For adding more emotion u can
duplicate special characters for example "!!!".
Do not remove or add any different words.
Only alter the presentation of the existing words.
Also you can add pause in the output text if it needed
The most consistent way is programmatically using the syntax <break time="1.5s" />. or any time in second if it fit to the text
This will create an exact and natural pause in the speech.
It is not just added silence between words,
but the AI has an actual understanding of this syntax and will add a natural pause.
After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
according to the level of emotional intensity in the modified text.
Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
Your output should be in the following JSON format:
{
"modified_text": "Modified text with emotional adjustments.",
"params": {
"stability": 0.7,
"similarity_boost": 0.5,
"style": 0.3
}
}
The "stability" parameter should range from 0 to 1,
with lower values indicating a more expressive, less stable voice.
The "similarity_boost" parameter should also range from 0 to 1,
with higher values indicating more emphasis on the voice similarity.
The "style" parameter should also range from 0 to 1,
where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
Adjust both according to the emotional intensity of the text.
Example of text that could be passed:
Text: "I can't believe this is happening."
"""
TEXT_MODIFICATION_WITH_SSML = """
You should help me to make an audiobook with overabundant emotion-based voice using TTS.
You are tasked with transforming the text provided into a sophisticated SSML script
that is optimized for emotionally, dramatically and breathtaking rich audiobook narration.
Analyze the text for underlying emotions, detect nuances in intonation, and discern the intended impact.
Apply suitable SSML enhancements to ensure that the final TTS output delivers
a powerful, engaging, dramatic and breathtaking listening experience appropriate for an audiobook context
(more effects/emotions are better than less)."
Please, use only provided SSML tags and don't generate any other tags.
Key SSML Tags to Utilize:
<speak>: This is the root element. All SSML content to be synthesized must be enclosed within this tag.
<prosody>: Manipulates pitch, rate, and volume to convey various emotions and emphases. Use this tag to adjust the voice to match the mood and tone of different parts of the narrative.
<break>: Inserts pauses of specified durations. Use this to create natural breaks in speech, aiding in dramatic effect and better comprehension for listeners.
<emphasis>: Adds stress to words or phrases to highlight key points or emotions, similar to vocal emphasis in natural speech.
<p> and <s>: Structural tags that denote paragraphs and sentences, respectively. They help to manage the flow and pacing of the narrative appropriately.
Input Text Example: "He stood there, gazing into the endless horizon. As the sun slowly sank, painting the sky with hues of orange and red, he felt a sense of deep melancholy mixed with awe."
Modified text should be in the XML format. Expected SSML-enriched Output:
<speak>
<p>
<s>
He stood there, <prosody rate="slow" volume="soft">gazing into the endless horizon.</prosody>
</s>
<s>
As the sun slowly <prosody rate="medium" pitch="-2st">sank,</prosody>
<prosody volume="medium" pitch="+1st">painting the sky with hues of orange and red,</prosody>
he felt a sense of deep <prosody volume="soft" pitch="-1st">melancholy</prosody> mixed with <emphasis level="moderate">awe.</emphasis>
</s>
</p>
</speak>
After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
according to the level of emotional intensity in the modified text.
Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
Your output should be in the following JSON format:
{
"modified_text": "Modified text in xml format with SSML tags.",
"params": {
"stability": 0.7,
"similarity_boost": 0.5,
"style": 0.3
}
}
The "stability" parameter should range from 0 to 1,
with lower values indicating a more expressive, less stable voice.
The "similarity_boost" parameter should also range from 0 to 1,
with higher values indicating more emphasis on the voice similarity.
The "style" parameter should also range from 0 to 1,
where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
Adjust both according to the emotional intensity of the text.
""" |