Spaces:

ales
/

ai-audio-books

Running

App Files Files Community

ai-audio-books / src /emotions /prompts.py

bl4dylion

add text_mod and audio normalization

3ee8f12 3 months ago

raw

history blame

8.21 kB

	SOUND_EFFECT_GENERATION = """
	You should help me to make an audiobook with realistic emotion sound using TTS.
	You are tasked with generating a description of sound effects
	that matches the atmosphere, actions, and tone of a given sentence or text from a book.
	The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
	The generated sound description must evoke the scene
	or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
	and it should be succinct and fit the mood of the text.

	Additionally, you should include the following parameters in your response:

	Text: A generated description of the sound that matches the text provided.
	Keep the description simple and effective to capture the soundscape.
	This text will be converted into a sound effect.
	Duration_seconds: The appropriate duration of the sound effect,
	which should be calculated based on the length and nature of the scene.
	Cap this duration at 22 seconds. But be carefully, for very long text in input make a long sound effect,
	for small make a small one. And the duration should be similar to duration of input text
	Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
	follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
	use a value around 0.3. For more specific or detailed sound scenes
	(e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.

	Your output should be in the following JSON format:

	{
	"text": "A soft breeze rustling through leaves, distant birds chirping.",
	"duration_seconds": 4.0,
	"prompt_influence": 0.4
	}

	"""

	SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION = """
	You should help me to make an audiobook with realistic emotion sound using TTS.
	You are tasked with generating a description of sound effects
	that matches the atmosphere, actions, and tone of a given sentence or text from a book.
	The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
	The generated sound description must evoke the scene
	or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
	and it should be succinct and fit the mood of the text.

	Additionally, you should include the following parameters in your response:

	Text: A generated description of the sound that matches the text provided.
	Keep the description simple and effective to capture the soundscape.
	This text will be converted into a sound effect.
	Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
	follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
	use a value around 0.3. For more specific or detailed sound scenes
	(e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.

	Your output should be in the following JSON format:

	{
	"text": "A soft breeze rustling through leaves, distant birds chirping.",
	"prompt_influence": 0.4
	}

	"""

	TEXT_MODIFICATION = """
	You should help me to make an audiobook with realistic emotion-based voice using TTS.
	You are tasked with adjusting the emotional tone of a given text
	by modifying the text with special characters such as "!", "...", "-", "~",
	and uppercase words to add emphasis or convey emotion. For adding more emotion u can
	duplicate special characters for example "!!!".
	Do not remove or add any different words.
	Only alter the presentation of the existing words.

	Also you can add pause in the output text if it needed
	The most consistent way is programmatically using the syntax <break time="1.5s" />. or any time in second if it fit to the text
	This will create an exact and natural pause in the speech.
	It is not just added silence between words,
	but the AI has an actual understanding of this syntax and will add a natural pause.

	After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
	according to the level of emotional intensity in the modified text.
	Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
	Your output should be in the following JSON format:
	{
	"modified_text": "Modified text with emotional adjustments.",
	"params": {
	"stability": 0.7,
	"similarity_boost": 0.5,
	"style": 0.3
	}
	}

	The "stability" parameter should range from 0 to 1,
	with lower values indicating a more expressive, less stable voice.
	The "similarity_boost" parameter should also range from 0 to 1,
	with higher values indicating more emphasis on the voice similarity.
	The "style" parameter should also range from 0 to 1,
	where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
	Adjust both according to the emotional intensity of the text.

	Example of text that could be passed:

	Text: "I can't believe this is happening."
	"""

	TEXT_MODIFICATION_WITH_SSML = """
	You should help me to make an audiobook with overabundant emotion-based voice using TTS.
	You are tasked with transforming the text provided into a sophisticated SSML script
	that is optimized for emotionally, dramatically and breathtaking rich audiobook narration.
	Analyze the text for underlying emotions, detect nuances in intonation, and discern the intended impact.
	Apply suitable SSML enhancements to ensure that the final TTS output delivers
	a powerful, engaging, dramatic and breathtaking listening experience appropriate for an audiobook context
	(more effects/emotions are better than less)."

	Please, use only provided SSML tags and don't generate any other tags.
	Key SSML Tags to Utilize:
	<speak>: This is the root element. All SSML content to be synthesized must be enclosed within this tag.
	<prosody>: Manipulates pitch, rate, and volume to convey various emotions and emphases. Use this tag to adjust the voice to match the mood and tone of different parts of the narrative.
	<break>: Inserts pauses of specified durations. Use this to create natural breaks in speech, aiding in dramatic effect and better comprehension for listeners.
	<emphasis>: Adds stress to words or phrases to highlight key points or emotions, similar to vocal emphasis in natural speech.
	<p> and <s>: Structural tags that denote paragraphs and sentences, respectively. They help to manage the flow and pacing of the narrative appropriately.

	Input Text Example: "He stood there, gazing into the endless horizon. As the sun slowly sank, painting the sky with hues of orange and red, he felt a sense of deep melancholy mixed with awe."

	Modified text should be in the XML format. Expected SSML-enriched Output:

	<speak>
	<p>
	<s>
	He stood there, <prosody rate="slow" volume="soft">gazing into the endless horizon.</prosody>
	</s>
	<s>
	As the sun slowly <prosody rate="medium" pitch="-2st">sank,</prosody>
	<prosody volume="medium" pitch="+1st">painting the sky with hues of orange and red,</prosody>
	he felt a sense of deep <prosody volume="soft" pitch="-1st">melancholy</prosody> mixed with <emphasis level="moderate">awe.</emphasis>
	</s>
	</p>
	</speak>

	After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
	according to the level of emotional intensity in the modified text.
	Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
	Your output should be in the following JSON format:
	{
	"modified_text": "Modified text in xml format with SSML tags.",
	"params": {
	"stability": 0.7,
	"similarity_boost": 0.5,
	"style": 0.3
	}
	}

	The "stability" parameter should range from 0 to 1,
	with lower values indicating a more expressive, less stable voice.
	The "similarity_boost" parameter should also range from 0 to 1,
	with higher values indicating more emphasis on the voice similarity.
	The "style" parameter should also range from 0 to 1,
	where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
	Adjust both according to the emotional intensity of the text.
	"""