|
--- |
|
datasets: |
|
- librispeech_asr |
|
- declare-lab/MELD |
|
- PolyAI/minds14 |
|
- google/fleurs |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- mae |
|
- pearsonr |
|
- exact_match |
|
tags: |
|
- audio |
|
- speech |
|
- pre-training |
|
- spoken language understanding |
|
- music |
|
license: apache-2.0 |
|
--- |
|
|
|
**Repository:** https://github.com/declare-lab/segue |
|
|
|
**Paper:** https://arxiv.org/abs/2305.12301 |
|
|
|
SEGUE is a pre-training approach for sequence-level spoken language understanding (SLU) tasks. |
|
We use knowledge distillation on a parallel speech-text corpus (e.g. an ASR corpus) to distil |
|
language understanding knowledge from a textual sentence embedder to a pre-trained speech encoder. |
|
SEGUE applied to Wav2Vec 2.0 improves performance for many SLU tasks, including |
|
intent classification / slot-filling, spoken sentiment analysis, and spoken emotion classification. |
|
These improvements were observed in both fine-tuned and non-fine-tuned settings, as well as few-shot settings. |
|
|
|
## How to Get Started with the Model |
|
|
|
To use this model checkpoint, you need to use the model classes on [our GitHub repository](https://github.com/declare-lab/segue). |
|
|
|
```python3 |
|
from segue.modeling_segue import SegueModel |
|
import soundfile |
|
|
|
# assuming this is 16kHz mono audio |
|
raw_audio_array, sampling_rate = soundfile.read('example.wav') |
|
|
|
model = SegueModel.from_pretrained('declare-lab/segue-w2v2-base') |
|
inputs = model.processor(audio = raw_audio_array, sampling_rate = sampling_rate) |
|
outputs = model(**inputs) |
|
``` |
|
|
|
You do not need to create the `Processor` yourself, it is already available as `model.processor`. |
|
|
|
`SegueForRegression` and `SegueForClassification` are also available. For classification, |
|
the number of classes can be specified through the n_classes field in model config, |
|
e.g. `SegueForClassification.from_pretrained('declare-lab/segue-w2v2-base', n_classes=7)`. |
|
Multi-label classification is also supported, e.g. `n_classes=[3, 7]` for two labels with 3 and 7 classes respectively. |
|
|
|
Pre-training and downstream task training scripts are available on [our GitHub repository](https://github.com/declare-lab/segue). |
|
|
|
## Results |
|
|
|
We show only simplified MInDS-14 and MELD results for brevity. |
|
Please refer to the paper for full results. |
|
|
|
### MInDS-14 (intent classification) |
|
|
|
*Note: we used only the en-US subset of MInDS-14.* |
|
|
|
#### Fine-tuning |
|
|
|
|Model|Accuracy| |
|
|-|-| |
|
|w2v 2.0|89.4±2.3| |
|
|SEGUE|**97.6±0.5**| |
|
|
|
*Note: Wav2Vec 2.0 fine-tuning was unstable. Only 3 out of 6 runs converged, the result shown were taken from converged runs only.* |
|
|
|
#### Frozen encoder |
|
|
|
|Model|Accuracy| |
|
|-|-| |
|
|w2v 2.0|54.0| |
|
|SEGUE|**77.9**| |
|
|
|
### MELD (sentiment and emotion classification) |
|
|
|
#### Fine-tuning |
|
|
|
|Model|Sentiment F1|Emotion F1| |
|
|-|-|-| |
|
|w2v 2.0|47.3|39.3| |
|
|SEGUE|53.2|41.1| |
|
|SEGUE (higher LR)|**54.1**|**47.2**| |
|
|
|
*Note: Wav2Vec 2.0 fine-tuning was unstable at the higher LR.* |
|
|
|
#### Frozen encoder |
|
|
|
|Model|Sentiment F1|Emotion F1| |
|
|-|-|-| |
|
|w2v 2.0|45.0±0.7|34.3±1.2| |
|
|SEGUE|**45.8±0.1**|**35.7±0.3**| |
|
|
|
## Limitations |
|
|
|
In the paper, we hypothesized that SEGUE may perform worse on tasks that rely less on |
|
understanding and more on word detection. This may explain why SEGUE did not manage to |
|
improve upon Wav2Vec 2.0 on the Fluent Speech Commands (FSC) task. We also experimented with |
|
an ASR task (FLEURS), which heavily relies on word detection, to further demonstrate this. |
|
|
|
However, this is does not mean that SEGUE performs worse on intent classification tasks |
|
in general. MInDS-14, was able to benifit greatly from SEGUE despite also being an intent |
|
classification task, as it has more free-form utterances that may benefit more from |
|
understanding. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{segue2023, |
|
title={Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding}, |
|
author={Tan, Yi Xuan and Majumder, Navonil and Poria, Soujanya}, |
|
booktitle={Interspeech}, |
|
year={2023} |
|
} |
|
``` |