File size: 14,940 Bytes
f059506
992636d
f059506
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc6cc3e
9a26995
bc6cc3e
 
f059506
 
 
9a26995
f059506
9a26995
f059506
9a26995
f059506
9a26995
f059506
9a26995
 
 
 
 
f059506
 
 
 
9a26995
f059506
 
 
 
9a26995
f059506
 
 
 
9a26995
 
ca79a82
9a26995
 
f812290
ca79a82
9a26995
f059506
 
 
 
 
 
 
 
 
 
9a26995
f059506
 
d2207a6
f059506
 
 
 
 
 
 
 
9a26995
f059506
9a26995
ea494c8
ebfd651
ea494c8
 
 
9a26995
 
 
 
ea494c8
ebfd651
ea494c8
 
 
 
 
9a26995
 
f059506
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
992636d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
license: cc-by-nc-4.0
library_name: nemo
datasets:
- fisher_english
- NIST_SRE_2004-2010
- librispeech
- ami_meeting_corpus
- voxconverse_v0.3
- icsi
- aishell4
- dihard_challenge-3
- NIST_SRE_2000-Disc8_split1
thumbnail: null
tags:
- speaker-diarization
- speaker-recognition
- speech
- audio
- Transformer
- FastConformer
- Conformer
- NEST
- pytorch
- NeMo
widget:
- example_title: Librispeech sample 1
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: diar_sortformer_4spk-v1
  results:
  - task:
      name: Speaker Diarization
      type: speaker-diarization-with-post-processing
    dataset:
      name: DIHARD3-eval 
      type: dihard3-eval-1to4spks
      config: with_overlap_collar_0.0s
      split: eval
    metrics:
    - name: Test DER
      type: der
      value: 14.76
  - task:
      name: Speaker Diarization
      type: speaker-diarization-with-post-processing
    dataset:
      name: CALLHOME (NIST-SRE-2000 Disc8)
      type: CALLHOME-part2-2spk
      config: with_overlap_collar_0.25s
      split: part2-2spk
    metrics:
    - name: Test DER
      type: der
      value: 5.85
  - task:
      name: Speaker Diarization
      type: speaker-diarization-with-post-processing
    dataset:
      name: CALLHOME (NIST-SRE-2000 Disc8)
      type: CALLHOME-part2-3spk
      config: with_overlap_collar_0.25s
      split: part2-3spk
    metrics:
    - name: Test DER
      type: der
      value: 8.46
  - task:
      name: Speaker Diarization
      type: speaker-diarization-with-post-processing
    dataset:
      name: CALLHOME (NIST-SRE-2000 Disc8)
      type: CALLHOME-part2-4spk
      config: with_overlap_collar_0.25s
      split: part2-4spk
    metrics:
    - name: Test DER
      type: der
      value: 12.59
  - task:
      name: Speaker Diarization
      type: speaker-diarization-with-post-processing
    dataset:
      name: call_home_american_english_speech
      type: CHAES_2spk_109sessions
      config: with_overlap_collar_0.25s
      split: ch109
    metrics:
    - name: Test DER
      type: der
      value: 6.86
metrics:
- der
pipeline_tag: audio-classification
---


# Sortformer Diarizer 4spk v1

<style>
img {
 display: inline;
}
</style>

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
<!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->

[Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.

<div align="center">
    <img src="sortformer_intro.png" width="750" />
</div>

Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker. 

## Model Architecture

Sortformer consists of an L-size (18 layers) [NeMo Encoder for
Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[3] encoder. Following that, an 18-layer Transformer[4] encoder with hidden size of 192, 
and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].

<div align="center">
    <img src="sortformer-v1-model.png" width="450" />
</div>

## NVIDIA NeMo

To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
```
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
```

## How to Use this Model

The model is available for use in the NeMo Framework[5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### Loading the Model

```python
from nemo.collections.asr.models import SortformerEncLabelModel

# load model from a downloaded file
diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location=torch.device('cuda'), strict=False)
# load model from Hugging Face model card directly (You need a Hugging Face token)
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")
```

### Input Format
Input to Sortformer can be an individual audio file:
```python
audio_input="/path/to/multispeaker_audio1.wav"
```
or a list of paths to audio files:
```python
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
```
or a jsonl manifest file:
```python
audio_input="/path/to/multispeaker_manifest.json"
```
where each line is a dictionary containing the following fields:
```yaml
# Example of a line in `multispeaker_manifest.json`
{
    "audio_filepath": "/path/to/multispeaker_audio1.wav",  # path to the input audio file 
    "offset": 0, # offset (start) time of the input audio
    "duration": 600,  # duration of the audio, can be set to `null` if using NeMo main branch
}
{
    "audio_filepath": "/path/to/multispeaker_audio2.wav",  
    "offset": 900,
    "duration": 580,  
}
```

### Getting Diarization Results
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
```python3
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
```
To obtain tensors of speaker activity probabilities, use:
```python3
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
```

### Input

This model accepts single-channel (mono) audio sampled at 16,000 Hz.
- The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal. 
- For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.

### Output

The output of the model is a T x S matrix, where:  
- S is the maximum number of speakers (in this model, S = 4).  
- T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.  
- Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range.  For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.


## Train and evaluate Sortformer diarizer using NeMo
### Training

Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).

### Evaluation

To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
```bash
python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py 
 model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
 manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
 collar=COLLAR \
 out_rttm_dir="/path/to/output_rttms"
```

You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
```bash
python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
 model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
 manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
 collar=COLLAR \
 bypass_postprocessing=False \
 postprocessing_yaml="/path/to/postprocessing_config.yaml" \
 out_rttm_dir="/path/to/output_rttms"
```

### Technical Limitations

- The model operates in a non-streaming mode (offline mode).
- It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
- The maximum duration of a test recording depends on available GPU memory. For an RTX A6000 48GB model, the limit is around 12 minutes.
- The model was trained on publicly available speech datasets, primarily in English. As a result:
    * Performance may degrade on non-English speech.
    * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.


## Datasets

Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[6].
All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.


### Training Datasets (Real conversations)
- Fisher English (LDC)
- 2004-2010 NIST Speaker Recognition Evaluation (LDC)
- Librispeech 
- AMI Meeting Corpus
- VoxConverse-v0.3
- ICSI
- AISHELL-4
- Third DIHARD Challenge Development (LDC)
- 2000 NIST Speaker Recognition Evaluation, split1 (LDC)

### Training Datasets (Used to simulate audio mixtures)
- 2004-2010 NIST Speaker Recognition Evaluation (LDC)
- Librispeech

## Performance


### Evaluation dataset specifications

| **Dataset**                   | **DIHARD3-Eval**   | **CALLHOME-part2**  | **CALLHOME-part2**  | **CALLHOME-part2**  | **CH109**          |
|:------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
| **Number of Speakers**        | ≤ 4 speakers       | 2 speakers          | 3 speakers          | 4 speakers          | 2 speakers         |
| **Collar (sec)**              | 0.0s               | 0.25s               | 0.25s               | 0.25s               | 0.25s              |
| **Mean Audio Duration (sec)** | 453.0s             | 73.0s               | 135.7s              | 329.8s              | 552.9s             |

### Diarization Error Rate (DER)

* All evaluations include overlapping speech.  
* Bolded and italicized numbers represent the best-performing Sortformer evaluations.
* Post-Processing (PP) is optimized on two different held-out dataset splits. 
    - [YAML file for DH3-dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml)    
    - [YAML file for CallHome-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml)    


| **Dataset**                                               | **DIHARD3-Eval**   | **CALLHOME-part2**  | **CALLHOME-part2**  | **CALLHOME-part2**  | **CH109**          |
|:----------------------------------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
| DER **diar_sortformer_4spk-v1**                           | 16.28              | 6.49                | 10.01               | 14.14               | **_6.27_**         |
| DER **diar_sortformer_4spk-v1 + DH3-dev Opt. PP**         | **_14.76_**        | -                   | -                   | -                   | -                  |
| DER **diar_sortformer_4spk-v1 + CallHome-part1 Opt. PP**  | -                  | **_5.85_**          | **_8.46_**          | **_12.59_**         | 6.86               |

### Real Time Factor (RTFx)

All tests were measured on RTX A6000 48GB with batch size of 1. Post-processing is not included in RTFx calculations.

| **Datasets**                      |  **DIHARD3-Eval**   | **CALLHOME-part2**  | **CALLHOME-part2**  | **CALLHOME-part2**  | **CH109**          |
|:----------------------------------|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
| RTFx **diar_sortformer_4spk-v1**  |  437                | 1053                | 915                 | 545                 | 415                |


## NVIDIA Riva: Deployment

[NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. 
Additionally, Riva provides: 

* World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours 
* Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization 
* Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support 

Although this model isn’t supported yet by Riva, the [list of supported models](https://huggingface.co/models?other=Riva) is here.  
Check out [Riva live demo](https://developer.nvidia.com/riva#demos). 


## References
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)

[2] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)

[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

[4] [Attention is all you need](https://arxiv.org/abs/1706.03762)

[5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)

[6] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)

## Licence

License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.