openbmb
/

MiniCPM-o-2_6

Model card Files Files and versions Community

3v324v23 commited on 11 days ago

Commit

7063fbc

1 Parent(s): 8670477

update readme, add audio input examples

Browse files

Files changed (8) hide show

README.md +7 -18
assets/input_examples/Trump_WEF_2018_10s.mp3 +0 -0
assets/{audio_understanding.mp3 → input_examples/audio_understanding.mp3} +0 -0
assets/input_examples/chi-english-1.wav +0 -0
assets/input_examples/cxk_original.wav +0 -0
assets/input_examples/exciting-emotion.wav +0 -0
assets/input_examples/fast-pace.wav +0 -0
assets/input_examples/indian-accent.wav +3 -0

README.md CHANGED Viewed

@@ -1000,7 +1000,7 @@ model.tts.float()
 ```
 ### Omni mode
-we provide two inference modes: chat and streaming
 #### Chat inference
 ```python
@@ -1153,12 +1153,10 @@ model.tts.float()
 `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
-<details> <summary>Click here to demonstrate the capability of end-to-end audio understanding and generation. </summary>
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
 audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
-msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
 res = model.chat(
     msgs=msgs,
     tokenizer=tokenizer,
@@ -1171,13 +1169,11 @@ res = model.chat(
 )
 ```
-</details>
 <hr/>
 #### General Speech Conversation with Configurable Voices
-A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
 ```python
@@ -1219,7 +1215,7 @@ print(res)
 #### Speech Conversation as an AI Assistant
-An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following.
 ```python
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
@@ -1259,10 +1255,7 @@ print(res)
 #### Instruction-to-Speech
-MiniCPM-o-2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
-<details>
-<summary> Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. </summary>
 ```python
 instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
@@ -1282,16 +1275,13 @@ res = model.chat(
     output_audio_path='result.wav',
 )
 ```
-</details>
 <hr/>
 #### Voice Cloning
-MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
-<details>
-<summary> Click to show Python code running MiniCPM-o 2.6 with voice cloning. </summary>
 ```python
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
@@ -1311,13 +1301,12 @@ res = model.chat(
 )
 ```
-</details>
 <hr/>
 #### Addressing Various Audio Understanding Tasks
-MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
 For audio-to-text tasks, you can use the following prompts:

 ```
 ### Omni mode
+We provide two inference modes: chat and streaming
 #### Chat inference
 ```python
 `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
 audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
+msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
 res = model.chat(
     msgs=msgs,
     tokenizer=tokenizer,
 )
 ```
 <hr/>
 #### General Speech Conversation with Configurable Voices
+A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
 ```python
 #### Speech Conversation as an AI Assistant
+An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. But it is more instruction-following.
 ```python
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
 #### Instruction-to-Speech
+`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
 ```python
 instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. '
     output_audio_path='result.wav',
 )
 ```
 <hr/>
 #### Voice Cloning
+`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
 ```python
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
 )
 ```
 <hr/>
 #### Addressing Various Audio Understanding Tasks
+`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
 For audio-to-text tasks, you can use the following prompts:

assets/input_examples/Trump_WEF_2018_10s.mp3 ADDED Viewed

Binary file (161 kB). View file

assets/{audio_understanding.mp3 → input_examples/audio_understanding.mp3} RENAMED Viewed

File without changes

assets/input_examples/chi-english-1.wav ADDED Viewed

Binary file (492 kB). View file

assets/input_examples/cxk_original.wav ADDED Viewed

Binary file (384 kB). View file

assets/input_examples/exciting-emotion.wav ADDED Viewed

Binary file (696 kB). View file

assets/input_examples/fast-pace.wav ADDED Viewed

Binary file (986 kB). View file

assets/input_examples/indian-accent.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:716533af9ec8ca4586035af7dce4a65ce93e851d3f2474242d401b50bf3c0cdc
+size 1408844