Update README.md
Browse files
README.md
CHANGED
@@ -1128,14 +1128,37 @@ else:
|
|
1128 |
|
1129 |
```
|
1130 |
|
1131 |
-
|
1132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1133 |
`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
|
|
|
|
|
|
|
1134 |
```python
|
1135 |
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
|
1136 |
-
audio_input, _ = librosa.load('
|
1137 |
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
|
1138 |
-
|
1139 |
res = model.chat(
|
1140 |
msgs=msgs,
|
1141 |
tokenizer=tokenizer,
|
@@ -1148,24 +1171,63 @@ res = model.chat(
|
|
1148 |
)
|
1149 |
```
|
1150 |
|
1151 |
-
|
|
|
|
|
|
|
|
|
|
|
1152 |
<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
|
1153 |
|
1154 |
```python
|
1155 |
-
ref_audio, _ = librosa.load('assets/
|
|
|
1156 |
|
1157 |
-
#
|
1158 |
-
|
1159 |
-
|
1160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1161 |
|
1162 |
-
|
1163 |
-
|
1164 |
-
user_question = {'role': 'user', 'content': [librosa.load('
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1165 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1166 |
```python
|
1167 |
-
|
|
|
|
|
1168 |
# round one
|
|
|
1169 |
res = model.chat(
|
1170 |
msgs=msgs,
|
1171 |
tokenizer=tokenizer,
|
@@ -1193,28 +1255,22 @@ res = model.chat(
|
|
1193 |
)
|
1194 |
print(res)
|
1195 |
```
|
1196 |
-
|
1197 |
</details>
|
1198 |
|
1199 |
-
|
|
|
|
|
|
|
|
|
1200 |
<details>
|
1201 |
-
<summary> Click to
|
1202 |
|
1203 |
```python
|
1204 |
-
''
|
1205 |
-
|
1206 |
-
|
1207 |
-
|
1208 |
-
|
1209 |
-
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
|
1210 |
-
General Audio:
|
1211 |
-
Audio Caption: Summarize the main content of the audio.
|
1212 |
-
Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
|
1213 |
-
'''
|
1214 |
-
task_prompt = "Summarize the main content of the audio.\n" # Choose the task prompt above
|
1215 |
-
audio_input, _ = librosa.load('assets/audio_understanding.mp3', sr=16000, mono=True)
|
1216 |
-
|
1217 |
-
msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
|
1218 |
|
1219 |
res = model.chat(
|
1220 |
msgs=msgs,
|
@@ -1226,29 +1282,22 @@ res = model.chat(
|
|
1226 |
temperature=0.3,
|
1227 |
output_audio_path='result.wav',
|
1228 |
)
|
1229 |
-
print(res)
|
1230 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1231 |
```python
|
1232 |
-
'''
|
1233 |
-
|
1234 |
-
|
1235 |
-
Example:
|
1236 |
-
# 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
|
1237 |
-
# Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
|
1238 |
-
|
1239 |
-
Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model.
|
1240 |
-
'''
|
1241 |
-
# Human Instruction-to-Speech:
|
1242 |
-
task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
|
1243 |
-
msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
|
1244 |
-
|
1245 |
-
# Voice Cloning mode:
|
1246 |
-
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
|
1247 |
-
# text_prompt = f"Please read the text below."
|
1248 |
-
# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
|
1249 |
-
# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
|
1250 |
-
# msgs = [sys_prompt, user_question]
|
1251 |
|
|
|
1252 |
res = model.chat(
|
1253 |
msgs=msgs,
|
1254 |
tokenizer=tokenizer,
|
@@ -1260,8 +1309,48 @@ res = model.chat(
|
|
1260 |
output_audio_path='result.wav',
|
1261 |
)
|
1262 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1263 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1264 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
1265 |
|
1266 |
</details>
|
1267 |
|
|
|
1128 |
|
1129 |
```
|
1130 |
|
1131 |
+
|
1132 |
+
|
1133 |
+
#### Speech Conversation
|
1134 |
+
<details> <summary> Model initialization </summary>
|
1135 |
+
|
1136 |
+
```python
|
1137 |
+
import torch
|
1138 |
+
import librosa
|
1139 |
+
from transformers import AutoModel, AutoTokenizer
|
1140 |
+
|
1141 |
+
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
|
1142 |
+
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
1143 |
+
model = model.eval().cuda()
|
1144 |
+
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
|
1145 |
+
|
1146 |
+
model.init_tts()
|
1147 |
+
model.tts.float()
|
1148 |
+
```
|
1149 |
+
|
1150 |
+
</details>
|
1151 |
+
|
1152 |
+
##### Mimick
|
1153 |
+
|
1154 |
`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
|
1155 |
+
|
1156 |
+
<details> <summary>Click here to demonstrate the capability of end-to-end audio understanding and generation. </summary>
|
1157 |
+
|
1158 |
```python
|
1159 |
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
|
1160 |
+
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
|
1161 |
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
|
|
|
1162 |
res = model.chat(
|
1163 |
msgs=msgs,
|
1164 |
tokenizer=tokenizer,
|
|
|
1171 |
)
|
1172 |
```
|
1173 |
|
1174 |
+
</details>
|
1175 |
+
|
1176 |
+
##### General Speech Conversation with Configurable Voices
|
1177 |
+
|
1178 |
+
A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
|
1179 |
+
|
1180 |
<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
|
1181 |
|
1182 |
```python
|
1183 |
+
ref_audio, _ = librosa.load('./assets/voice_01.wav', sr=16000, mono=True) # load the reference audio
|
1184 |
+
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
|
1185 |
|
1186 |
+
# round one
|
1187 |
+
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
|
1188 |
+
msgs = [sys_prompt, user_question]
|
1189 |
+
res = model.chat(
|
1190 |
+
msgs=msgs,
|
1191 |
+
tokenizer=tokenizer,
|
1192 |
+
sampling=True,
|
1193 |
+
max_new_tokens=128,
|
1194 |
+
use_tts_template=True,
|
1195 |
+
generate_audio=True,
|
1196 |
+
temperature=0.3,
|
1197 |
+
output_audio_path='result.wav',
|
1198 |
+
)
|
1199 |
|
1200 |
+
# round two
|
1201 |
+
history = msgs.append({'role': 'assistant', 'content': res})
|
1202 |
+
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
|
1203 |
+
msgs = history.append(user_question)
|
1204 |
+
res = model.chat(
|
1205 |
+
msgs=msgs,
|
1206 |
+
tokenizer=tokenizer,
|
1207 |
+
sampling=True,
|
1208 |
+
max_new_tokens=128,
|
1209 |
+
use_tts_template=True,
|
1210 |
+
generate_audio=True,
|
1211 |
+
temperature=0.3,
|
1212 |
+
output_audio_path='result_round_2.wav',
|
1213 |
+
)
|
1214 |
+
print(res)
|
1215 |
```
|
1216 |
+
|
1217 |
+
</details>
|
1218 |
+
|
1219 |
+
##### Speech Conversation as an AI Assistant
|
1220 |
+
|
1221 |
+
An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following.
|
1222 |
+
|
1223 |
+
<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to act as an AI assistant.</summary>
|
1224 |
+
|
1225 |
```python
|
1226 |
+
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
|
1227 |
+
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
|
1228 |
+
|
1229 |
# round one
|
1230 |
+
msgs = [sys_prompt, user_question]
|
1231 |
res = model.chat(
|
1232 |
msgs=msgs,
|
1233 |
tokenizer=tokenizer,
|
|
|
1255 |
)
|
1256 |
print(res)
|
1257 |
```
|
|
|
1258 |
</details>
|
1259 |
|
1260 |
+
|
1261 |
+
##### Instruction-to-Speech
|
1262 |
+
|
1263 |
+
MiniCPM-o-2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
|
1264 |
+
|
1265 |
<details>
|
1266 |
+
<summary> Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. </summary>
|
1267 |
|
1268 |
```python
|
1269 |
+
instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
|
1270 |
+
|
1271 |
+
instruction = '在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。'
|
1272 |
+
|
1273 |
+
msgs = [{'role': 'user', 'content': [instruction]}]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1274 |
|
1275 |
res = model.chat(
|
1276 |
msgs=msgs,
|
|
|
1282 |
temperature=0.3,
|
1283 |
output_audio_path='result.wav',
|
1284 |
)
|
|
|
1285 |
```
|
1286 |
+
</details>
|
1287 |
+
|
1288 |
+
##### Voice Cloning
|
1289 |
+
|
1290 |
+
MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
|
1291 |
+
|
1292 |
+
<details>
|
1293 |
+
<summary> Click to show Python code running MiniCPM-o 2.6 with voice cloning. </summary>
|
1294 |
+
|
1295 |
```python
|
1296 |
+
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
|
1297 |
+
text_prompt = f"Please read the text below."
|
1298 |
+
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1299 |
|
1300 |
+
msgs = [sys_prompt, user_question]
|
1301 |
res = model.chat(
|
1302 |
msgs=msgs,
|
1303 |
tokenizer=tokenizer,
|
|
|
1309 |
output_audio_path='result.wav',
|
1310 |
)
|
1311 |
|
1312 |
+
```
|
1313 |
+
</details>
|
1314 |
+
|
1315 |
+
##### Addressing Various Audio Understanding Tasks
|
1316 |
+
|
1317 |
+
MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
|
1318 |
+
|
1319 |
+
<details>
|
1320 |
+
<summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
|
1321 |
+
|
1322 |
+
For audio-to-text tasks, you can use the following prompts:
|
1323 |
+
|
1324 |
+
- ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
|
1325 |
+
- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
|
1326 |
+
- Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
|
1327 |
+
- General Audio Caption: `Summarize the main content of the audio.`
|
1328 |
+
- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
|
1329 |
|
1330 |
+
```python
|
1331 |
+
task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
|
1332 |
+
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
|
1333 |
+
|
1334 |
+
msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
|
1335 |
+
|
1336 |
+
res = model.chat(
|
1337 |
+
msgs=msgs,
|
1338 |
+
tokenizer=tokenizer,
|
1339 |
+
sampling=True,
|
1340 |
+
max_new_tokens=128,
|
1341 |
+
use_tts_template=True,
|
1342 |
+
generate_audio=True,
|
1343 |
+
temperature=0.3,
|
1344 |
+
output_audio_path='result.wav',
|
1345 |
+
)
|
1346 |
+
print(res)
|
1347 |
```
|
1348 |
+
</details>
|
1349 |
+
|
1350 |
+
|
1351 |
+
|
1352 |
+
|
1353 |
+
|
1354 |
|
1355 |
</details>
|
1356 |
|