File size: 2,872 Bytes
49babd2 7eacacf 49babd2 7eacacf 49babd2 d9290bd 49babd2 d9290bd 7eacacf 1b27420 64a7de7 183a43c 60d6884 183a43c 1b27420 477312d 1b27420 477312d 1b27420 74cd139 9c1f1a8 21c6ff5 9c1f1a8 74cd139 1b27420 3e30f71 1b27420 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
language:
- en
license: apache-2.0
library_name: transformers
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
metrics:
- wer
model-index:
- name: SpeechLLM
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 12.3
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 18.9
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: wer
value: 25.01
name: Test WER
---
# SpeechLLM
[The model is still training, we will be releasing the latest checkpoints soon...]
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav",
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
```
## Model Details
- Model Size : 2.1 B
- Checkpoint : 2000 k steps (bs=1)
- Adapters : r=4, alpha=8
- lr = 1e-4
- gradient accumulation steps : 8
## Checkpoint Result
| Dataset | Word Error Rate | Gender Acc|
|:----------------------:|:------------------:|:---------:|
| librispeech-test-clean | 0.1230 | 0.8778 |
| librispeech-test-other | 0.1890 | 0.8908 |
| CommonVoice test | 0.2501 | 0.8753 | |