3v324v23 commited on
Commit
7063fbc
·
1 Parent(s): 8670477

update readme, add audio input examples

Browse files
README.md CHANGED
@@ -1000,7 +1000,7 @@ model.tts.float()
1000
  ```
1001
 
1002
  ### Omni mode
1003
- we provide two inference modes: chat and streaming
1004
 
1005
  #### Chat inference
1006
  ```python
@@ -1153,12 +1153,10 @@ model.tts.float()
1153
 
1154
  `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1155
 
1156
- <details> <summary>Click here to demonstrate the capability of end-to-end audio understanding and generation. </summary>
1157
-
1158
  ```python
1159
  mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1160
  audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1161
- msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
1162
  res = model.chat(
1163
  msgs=msgs,
1164
  tokenizer=tokenizer,
@@ -1171,13 +1169,11 @@ res = model.chat(
1171
  )
1172
  ```
1173
 
1174
- </details>
1175
-
1176
  <hr/>
1177
 
1178
  #### General Speech Conversation with Configurable Voices
1179
 
1180
- A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1181
 
1182
 
1183
  ```python
@@ -1219,7 +1215,7 @@ print(res)
1219
 
1220
  #### Speech Conversation as an AI Assistant
1221
 
1222
- An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following.
1223
 
1224
  ```python
1225
  sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
@@ -1259,10 +1255,7 @@ print(res)
1259
 
1260
  #### Instruction-to-Speech
1261
 
1262
- MiniCPM-o-2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
1263
-
1264
- <details>
1265
- <summary> Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. </summary>
1266
 
1267
  ```python
1268
  instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
@@ -1282,16 +1275,13 @@ res = model.chat(
1282
  output_audio_path='result.wav',
1283
  )
1284
  ```
1285
- </details>
1286
 
1287
  <hr/>
1288
 
1289
  #### Voice Cloning
1290
 
1291
- MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1292
 
1293
- <details>
1294
- <summary> Click to show Python code running MiniCPM-o 2.6 with voice cloning. </summary>
1295
 
1296
  ```python
1297
  sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
@@ -1311,13 +1301,12 @@ res = model.chat(
1311
  )
1312
 
1313
  ```
1314
- </details>
1315
 
1316
  <hr/>
1317
 
1318
  #### Addressing Various Audio Understanding Tasks
1319
 
1320
- MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1321
 
1322
 
1323
  For audio-to-text tasks, you can use the following prompts:
 
1000
  ```
1001
 
1002
  ### Omni mode
1003
+ We provide two inference modes: chat and streaming
1004
 
1005
  #### Chat inference
1006
  ```python
 
1153
 
1154
  `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1155
 
 
 
1156
  ```python
1157
  mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1158
  audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1159
+ msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
1160
  res = model.chat(
1161
  msgs=msgs,
1162
  tokenizer=tokenizer,
 
1169
  )
1170
  ```
1171
 
 
 
1172
  <hr/>
1173
 
1174
  #### General Speech Conversation with Configurable Voices
1175
 
1176
+ A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1177
 
1178
 
1179
  ```python
 
1215
 
1216
  #### Speech Conversation as an AI Assistant
1217
 
1218
+ An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. But it is more instruction-following.
1219
 
1220
  ```python
1221
  sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
 
1255
 
1256
  #### Instruction-to-Speech
1257
 
1258
+ `MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
 
 
 
1259
 
1260
  ```python
1261
  instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
 
1275
  output_audio_path='result.wav',
1276
  )
1277
  ```
 
1278
 
1279
  <hr/>
1280
 
1281
  #### Voice Cloning
1282
 
1283
+ `MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1284
 
 
 
1285
 
1286
  ```python
1287
  sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
 
1301
  )
1302
 
1303
  ```
 
1304
 
1305
  <hr/>
1306
 
1307
  #### Addressing Various Audio Understanding Tasks
1308
 
1309
+ `MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1310
 
1311
 
1312
  For audio-to-text tasks, you can use the following prompts:
assets/input_examples/Trump_WEF_2018_10s.mp3 ADDED
Binary file (161 kB). View file
 
assets/{audio_understanding.mp3 → input_examples/audio_understanding.mp3} RENAMED
File without changes
assets/input_examples/chi-english-1.wav ADDED
Binary file (492 kB). View file
 
assets/input_examples/cxk_original.wav ADDED
Binary file (384 kB). View file
 
assets/input_examples/exciting-emotion.wav ADDED
Binary file (696 kB). View file
 
assets/input_examples/fast-pace.wav ADDED
Binary file (986 kB). View file
 
assets/input_examples/indian-accent.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:716533af9ec8ca4586035af7dce4a65ce93e851d3f2474242d401b50bf3c0cdc
3
+ size 1408844