Aku Rouhe commited on
Commit
ab2ad71
·
1 Parent(s): 04071ac

Add model card

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - ASR
6
+ - CTC
7
+ - Attention
8
+ - pytorch
9
+ license: "apache-2.0"
10
+ datasets:
11
+ - librispeech
12
+ metrics:
13
+ - wer
14
+ - cer
15
+ ---
16
+
17
+ # CRDNN with CTC/Attention and RNNLM trained on LibriSpeech
18
+
19
+ This repository provides all the necessary tools to perform automatic speech
20
+ recognition from an end-to-end system pretrained on LibriSpeech (EN) within
21
+ SpeechBrain. For a better experience we encourage you to learn more about
22
+ [SpeechBrain](https://speechbrain.github.io). The given ASR model performance are:
23
+
24
+ | Release | hyperparams file | Test WER | Model link | GPUs |
25
+ |:-------------:|:---------------------------:| -----:| -----:| --------:|
26
+ | 20-05-22 | BPE_1000.yaml | 3.08 | Not Available | 1xV100 32GB |
27
+ | 20-05-22 | BPE_5000.yaml | 2.89 | Not Available | 1xV100 32GB |
28
+
29
+ ## Pipeline description
30
+
31
+ This ASR system is composed with 3 different but linked blocks:
32
+ 1. Tokenizer (unigram) that transforms words into subword units and trained with
33
+ the train transcriptions of LibriSpeech.
34
+ 2. Neural language model (RNNLM) trained on the full 10M words dataset.
35
+ 3. Acoustic model (CRDNN + CTC/Attention). The CRDNN architecture is made of
36
+ N blocks of convolutional neural networks with normalisation and pooling on the
37
+ frequency domain. Then, a bidirectional LSTM is connected to a final DNN to obtain
38
+ the final acoustic representation that is given to the CTC and attention decoders.
39
+
40
+ ## Intended uses & limitations
41
+
42
+ This model has been primilarly developed to be run within SpeechBrain as a pretrained ASR model
43
+ for the english language. Thanks to the flexibility of SpeechBrain, any of the 3 blocks
44
+ detailed above can be extracted and connected to you custom pipeline as long as SpeechBrain is
45
+ installed.
46
+
47
+ ## Install SpeechBrain
48
+
49
+ First of all, please install SpeechBrain with the following command:
50
+
51
+ ```
52
+ pip install \\we hide ! SpeechBrain is still private :p
53
+ ```
54
+
55
+ Please notice that we encourage you to read our tutorials and learn more about
56
+ [SpeechBrain](https://speechbrain.github.io).
57
+
58
+ ### Transcribing your own audio files
59
+
60
+ ```python
61
+ import torch
62
+ import torchaudio
63
+ import speechbrain
64
+ from speechbrain.lobes.pretrained.librispeech.asr_crdnn_ctc_att_rnnlm.acoustic import ASR
65
+
66
+ asr_model = ASR()
67
+
68
+ # Make sure your output is sampled at 16 kHz.
69
+ audio_file='path_to_your_audio_file'
70
+ wav, fs = torchaudio.load(audio_file)
71
+ wav_lens = torch.tensor([1]).float()
72
+
73
+ # Transcribe!
74
+ words, tokens = asr_model.transcribe(wav, wav_lens)
75
+ print(words)
76
+
77
+ ```
78
+
79
+ ### Obtaining encoded features
80
+
81
+ The SpeechBrain ASR() Class provides an easy way to encode the speech signal
82
+ without running the decoding phase. Hence, one can obtain the output of the
83
+ CRDNN model.
84
+
85
+ ```python
86
+ import torch
87
+ import torchaudio
88
+ import speechbrain
89
+ from speechbrain.lobes.pretrained.librispeech.asr_crdnn_ctc_att_rnnlm.acoustic import ASR
90
+
91
+ asr_model = ASR()
92
+
93
+ # Make sure your output is sampled at 16 kHz.
94
+ audio_file='path_to_your_audio_file'
95
+ wav, fs = torchaudio.load(audio_file)
96
+ wav_lens = torch.tensor([1]).float()
97
+
98
+ # Transcribe!
99
+ words, tokens = asr_model.encode(wav, wav_lens)
100
+ print(words)
101
+
102
+ ```
103
+
104
+ ### Playing with the language model only
105
+
106
+ Thanks to SpeechBrain lobes, it is feasible to simply instantiate the language
107
+ model to further processing on your custom pipeline:
108
+
109
+ ```python
110
+ import torch
111
+ import speechbrain
112
+ from speechbrain.lobes.pretrained.librispeech.asr_crdnn_ctc_att_rnnlm.lm import LM
113
+
114
+ lm = LM()
115
+
116
+ text = "THE CAT IS ON"
117
+
118
+ # Next word prediction
119
+ encoded_text = lm.tokenizer.encode_as_ids(text)
120
+ encoded_text = torch.Tensor(encoded_text).unsqueeze(0)
121
+ prob_out, _ = lm(encoded_text.to(lm.device))
122
+ index = int(torch.argmax(prob_out[0,-1,:]))
123
+ print(lm.tokenizer.decode(index))
124
+
125
+ # Text generation
126
+ encoded_text = torch.tensor([0, 2]) # bos token + the
127
+ encoded_text = encoded_text.unsqueeze(0).to(lm.device)
128
+ for i in range(19):
129
+ prob_out, _ = lm(encoded_text)
130
+ index = torch.argmax(prob_out[0,-1,:]).unsqueeze(0)
131
+ encoded_text = torch.cat([encoded_text, index.unsqueeze(0)], dim=1)
132
+ encoded_text = encoded_text[0,1:].tolist()
133
+ print(lm.tokenizer.decode(encoded_text))
134
+
135
+ ```
136
+
137
+ ### Playing with the tokenizer only
138
+
139
+ In the same manner as for the language model, one can isntantiate the tokenizer
140
+ only with the corresponding lobes in SpeechBrain.
141
+
142
+ ```python
143
+ import speechbrain
144
+ from speechbrain.lobes.pretrained.librispeech.asr_crdnn_ctc_att_rnnlm.tokenizer import tokenizer
145
+
146
+ # HuggingFace paths to download the pretrained models
147
+ token_file = 'tokenizer/1000_unigram.model'
148
+ model_name = 'sb/asr-crdnn-librispeech'
149
+ save_dir = 'model_checkpoints'
150
+
151
+ text = "THE CAT IS ON THE TABLE"
152
+
153
+ tokenizer = tokenizer(token_file, model_name, save_dir)
154
+
155
+ # Tokenize!
156
+ print(tokenizer.spm.encode(text))
157
+ print(tokenizer.spm.encode(text, out_type='str'))
158
+
159
+ ```
160
+
161
+ #### Referencing SpeechBrain
162
+
163
+ ```
164
+ @misc{SB2021,
165
+ author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
166
+ title = {SpeechBrain},
167
+ year = {2021},
168
+ publisher = {GitHub},
169
+ journal = {GitHub repository},
170
+ howpublished = {\url{https://github.com/speechbrain/speechbrain}},
171
+ }
172
+ ```