yangwang825 commited on
Commit
e585064
·
1 Parent(s): d8e5505

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -1
README.md CHANGED
@@ -1,3 +1,110 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - speechbrain
6
+ - embeddings
7
+ - Speaker
8
+ - Verification
9
+ - Identification
10
+ - pytorch
11
+ - E-TDNN
12
+ license: "apache-2.0"
13
+ datasets:
14
+ - voxceleb
15
+ metrics:
16
+ - EER
17
+ - Accuracy
18
+ inference: true
19
+ widget:
20
+ - example_title: VoxCeleb Speaker id10003
21
+ src: https://cdn-media.huggingface.co/speech_samples/VoxCeleb1_00003.wav
22
+ - example_title: VoxCeleb Speaker id10004
23
+ src: https://cdn-media.huggingface.co/speech_samples/VoxCeleb_00004.wav
24
  ---
25
+
26
+ # Speaker Identification with E-TDNN embeddings on Voxceleb
27
+
28
+ This repository provides a pretrained E-TDNN model (x-vector) using SpeechBrain. The system can be used to extract speaker embeddings as well. Since we can't find any resource that has SpeechBrain or HuggingFace compatible checkpoints that has only been trained on VoxCeleb2 development data, so we decide to pre-train an E-TDNN system from scratch.
29
+
30
+ # Pipeline description
31
+
32
+ This system is composed of an E-TDNN model (x-vector). It is a combination of convolutional and residual blocks. The embeddings are extracted using temporal statistical pooling. The system is trained with Additive Margin Softmax Loss.
33
+
34
+ We use FBank (16kHz, 25ms frame length, 10ms hop length, 80 filter-bank channels) as the input features. It was trained using initial learning rate of 0.001 and batch size of 512 with linear scheduler for 30 epochs on 4 A100 GPUs. We employ additive noises and reverberation from [MUSAN](http://www.openslr.org/17/) and [RIR](http://www.openslr.org/28/) datasets to enrich the supervised information. The pre-training progress takes approximately seven days for the E-TDNN model.
35
+
36
+ # Performance
37
+
38
+ **VoxCeleb1-O** is the original verification test set from VoxCeleb1 consisting of 40 speakers. All speakers with names starting with "E" are reserved for testing. **VoxCeleb1-E** uses the entire VoxCeleb1 dataset, covering 1251 speakers. **VoxCeleb1-H** is a hard version of evaluation set consisting of 552536 pairs with 1190 speakers with the same nationality and gender. There are 18 nationality-gender combinations each with at least 5 individuals.
39
+
40
+ | Splits | Backend | S-norm | EER(%) | minDCF(0.01) |
41
+ |:-------------:|:--------------:|:--------------:|:--------------:|:--------------:|
42
+ | VoxCeleb1-O | cosine | no | 2.27 | 0.21 |
43
+ | VoxCeleb1-E | cosine | no | TBD | TBD |
44
+ | VoxCeleb1-H | cosine | no | TBD | TBD |
45
+
46
+ - VoxCeleb1-O: includes 37611 test pairs with 40 speakers.
47
+ - VoxCeleb1-E: includes 579818 test pairs with 1251 speakers.
48
+ - VoxCeleb1-H: includes 550894 test pairs with 1190 speakers.
49
+
50
+ # Compute the speaker embeddings
51
+
52
+ The system is trained with recordings sampled at 16kHz (single channel).
53
+
54
+ ```python
55
+ import torch
56
+ import torchaudio
57
+ from speechbrain.pretrained.interfaces import Pretrained
58
+ from speechbrain.pretrained import EncoderClassifier
59
+
60
+
61
+ class Encoder(Pretrained):
62
+
63
+ MODULES_NEEDED = [
64
+ "compute_features",
65
+ "mean_var_norm",
66
+ "embedding_model"
67
+ ]
68
+
69
+ def __init__(self, *args, **kwargs):
70
+ super().__init__(*args, **kwargs)
71
+
72
+ def encode_batch(self, wavs, wav_lens=None, normalize=False):
73
+ # Manage single waveforms in input
74
+ if len(wavs.shape) == 1:
75
+ wavs = wavs.unsqueeze(0)
76
+
77
+ # Assign full length if wav_lens is not assigned
78
+ if wav_lens is None:
79
+ wav_lens = torch.ones(wavs.shape[0], device=self.device)
80
+
81
+ # Storing waveform in the specified device
82
+ wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
83
+ wavs = wavs.float()
84
+
85
+ # Computing features and embeddings
86
+ feats = self.mods.compute_features(wavs)
87
+ feats = self.mods.mean_var_norm(feats, wav_lens)
88
+ embeddings = self.mods.embedding_model(feats, wav_lens)
89
+ if normalize:
90
+ embeddings = self.hparams.mean_var_norm_emb(
91
+ embeddings,
92
+ torch.ones(embeddings.shape[0], device=self.device)
93
+ )
94
+ return embeddings
95
+
96
+
97
+ classifier = Encoder.from_hparams(
98
+ source="yangwang825/etdnn-vox2"
99
+ )
100
+ signal, fs = torchaudio.load('spk1_snt1.wav')
101
+ embeddings = classifier.encode_batch(signal)
102
+ >>> torch.Size([1, 1, 192])
103
+ ```
104
+
105
+ We will release our training results (models, logs, etc) shortly.
106
+
107
+ # References
108
+
109
+ 1. Ravanelli et al., SpeechBrain: A General-Purpose Speech Toolkit, 2021
110
+ 2. Snyder et al., The JHU Speaker Recognition System for the VOiCES 2019 Challenge, 2019