bokesyo commited on
Commit
ca9a10b
·
verified ·
1 Parent(s): 21e853e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -51
README.md CHANGED
@@ -1128,14 +1128,37 @@ else:
1128
 
1129
  ```
1130
 
1131
- ### Audio-Only mode
1132
- #### Mimick
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1133
  `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
 
 
 
1134
  ```python
1135
  mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1136
- audio_input, _ = librosa.load('assets/mimick.wav', sr=16000, mono=True)
1137
  msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
1138
-
1139
  res = model.chat(
1140
  msgs=msgs,
1141
  tokenizer=tokenizer,
@@ -1148,24 +1171,63 @@ res = model.chat(
1148
  )
1149
  ```
1150
 
1151
- #### General Speech Conversation with Configurable Voices
 
 
 
 
 
1152
  <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
1153
 
1154
  ```python
1155
- ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
 
1156
 
1157
- # Choose the mode you want to use
1158
- # Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
1159
- # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1160
- # user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
 
 
 
 
 
 
 
 
 
1161
 
1162
- Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
1163
- sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1164
- user_question = {'role': 'user', 'content': [librosa.load('assets/qa.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
 
 
 
 
 
 
 
 
 
 
 
 
1165
  ```
 
 
 
 
 
 
 
 
 
1166
  ```python
1167
- msgs = [sys_prompt, user_question]
 
 
1168
  # round one
 
1169
  res = model.chat(
1170
  msgs=msgs,
1171
  tokenizer=tokenizer,
@@ -1193,28 +1255,22 @@ res = model.chat(
1193
  )
1194
  print(res)
1195
  ```
1196
-
1197
  </details>
1198
 
1199
- #### Addressing various audio tasks
 
 
 
 
1200
  <details>
1201
- <summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
1202
 
1203
  ```python
1204
- '''
1205
- Audio Understanding Task Prompt:
1206
- Speech:
1207
- ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
1208
- ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
1209
- Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
1210
- General Audio:
1211
- Audio Caption: Summarize the main content of the audio.
1212
- Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
1213
- '''
1214
- task_prompt = "Summarize the main content of the audio.\n" # Choose the task prompt above
1215
- audio_input, _ = librosa.load('assets/audio_understanding.mp3', sr=16000, mono=True)
1216
-
1217
- msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
1218
 
1219
  res = model.chat(
1220
  msgs=msgs,
@@ -1226,29 +1282,22 @@ res = model.chat(
1226
  temperature=0.3,
1227
  output_audio_path='result.wav',
1228
  )
1229
- print(res)
1230
  ```
 
 
 
 
 
 
 
 
 
1231
  ```python
1232
- '''
1233
- Speech Generation Task Prompt:
1234
- Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
1235
- Example:
1236
- # 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
1237
- # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
1238
-
1239
- Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model.
1240
- '''
1241
- # Human Instruction-to-Speech:
1242
- task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
1243
- msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
1244
-
1245
- # Voice Cloning mode:
1246
- # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1247
- # text_prompt = f"Please read the text below."
1248
- # user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
1249
- # user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
1250
- # msgs = [sys_prompt, user_question]
1251
 
 
1252
  res = model.chat(
1253
  msgs=msgs,
1254
  tokenizer=tokenizer,
@@ -1260,8 +1309,48 @@ res = model.chat(
1260
  output_audio_path='result.wav',
1261
  )
1262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1264
  ```
 
 
 
 
 
 
1265
 
1266
  </details>
1267
 
 
1128
 
1129
  ```
1130
 
1131
+
1132
+
1133
+ #### Speech Conversation
1134
+ <details> <summary> Model initialization </summary>
1135
+
1136
+ ```python
1137
+ import torch
1138
+ import librosa
1139
+ from transformers import AutoModel, AutoTokenizer
1140
+
1141
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1142
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1143
+ model = model.eval().cuda()
1144
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1145
+
1146
+ model.init_tts()
1147
+ model.tts.float()
1148
+ ```
1149
+
1150
+ </details>
1151
+
1152
+ ##### Mimick
1153
+
1154
  `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1155
+
1156
+ <details> <summary>Click here to demonstrate the capability of end-to-end audio understanding and generation. </summary>
1157
+
1158
  ```python
1159
  mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1160
+ audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1161
  msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
 
1162
  res = model.chat(
1163
  msgs=msgs,
1164
  tokenizer=tokenizer,
 
1171
  )
1172
  ```
1173
 
1174
+ </details>
1175
+
1176
+ ##### General Speech Conversation with Configurable Voices
1177
+
1178
+ A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1179
+
1180
  <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
1181
 
1182
  ```python
1183
+ ref_audio, _ = librosa.load('./assets/voice_01.wav', sr=16000, mono=True) # load the reference audio
1184
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1185
 
1186
+ # round one
1187
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1188
+ msgs = [sys_prompt, user_question]
1189
+ res = model.chat(
1190
+ msgs=msgs,
1191
+ tokenizer=tokenizer,
1192
+ sampling=True,
1193
+ max_new_tokens=128,
1194
+ use_tts_template=True,
1195
+ generate_audio=True,
1196
+ temperature=0.3,
1197
+ output_audio_path='result.wav',
1198
+ )
1199
 
1200
+ # round two
1201
+ history = msgs.append({'role': 'assistant', 'content': res})
1202
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1203
+ msgs = history.append(user_question)
1204
+ res = model.chat(
1205
+ msgs=msgs,
1206
+ tokenizer=tokenizer,
1207
+ sampling=True,
1208
+ max_new_tokens=128,
1209
+ use_tts_template=True,
1210
+ generate_audio=True,
1211
+ temperature=0.3,
1212
+ output_audio_path='result_round_2.wav',
1213
+ )
1214
+ print(res)
1215
  ```
1216
+
1217
+ </details>
1218
+
1219
+ ##### Speech Conversation as an AI Assistant
1220
+
1221
+ An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following.
1222
+
1223
+ <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to act as an AI assistant.</summary>
1224
+
1225
  ```python
1226
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1227
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1228
+
1229
  # round one
1230
+ msgs = [sys_prompt, user_question]
1231
  res = model.chat(
1232
  msgs=msgs,
1233
  tokenizer=tokenizer,
 
1255
  )
1256
  print(res)
1257
  ```
 
1258
  </details>
1259
 
1260
+
1261
+ ##### Instruction-to-Speech
1262
+
1263
+ MiniCPM-o-2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
1264
+
1265
  <details>
1266
+ <summary> Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. </summary>
1267
 
1268
  ```python
1269
+ instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
1270
+
1271
+ instruction = '在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。'
1272
+
1273
+ msgs = [{'role': 'user', 'content': [instruction]}]
 
 
 
 
 
 
 
 
 
1274
 
1275
  res = model.chat(
1276
  msgs=msgs,
 
1282
  temperature=0.3,
1283
  output_audio_path='result.wav',
1284
  )
 
1285
  ```
1286
+ </details>
1287
+
1288
+ ##### Voice Cloning
1289
+
1290
+ MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1291
+
1292
+ <details>
1293
+ <summary> Click to show Python code running MiniCPM-o 2.6 with voice cloning. </summary>
1294
+
1295
  ```python
1296
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1297
+ text_prompt = f"Please read the text below."
1298
+ user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1299
 
1300
+ msgs = [sys_prompt, user_question]
1301
  res = model.chat(
1302
  msgs=msgs,
1303
  tokenizer=tokenizer,
 
1309
  output_audio_path='result.wav',
1310
  )
1311
 
1312
+ ```
1313
+ </details>
1314
+
1315
+ ##### Addressing Various Audio Understanding Tasks
1316
+
1317
+ MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1318
+
1319
+ <details>
1320
+ <summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
1321
+
1322
+ For audio-to-text tasks, you can use the following prompts:
1323
+
1324
+ - ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
1325
+ - ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
1326
+ - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
1327
+ - General Audio Caption: `Summarize the main content of the audio.`
1328
+ - General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1329
 
1330
+ ```python
1331
+ task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
1332
+ audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1333
+
1334
+ msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
1335
+
1336
+ res = model.chat(
1337
+ msgs=msgs,
1338
+ tokenizer=tokenizer,
1339
+ sampling=True,
1340
+ max_new_tokens=128,
1341
+ use_tts_template=True,
1342
+ generate_audio=True,
1343
+ temperature=0.3,
1344
+ output_audio_path='result.wav',
1345
+ )
1346
+ print(res)
1347
  ```
1348
+ </details>
1349
+
1350
+
1351
+
1352
+
1353
+
1354
 
1355
  </details>
1356