smajumdar94 commited on
Commit
1257b2c
·
1 Parent(s): 941d5ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -0
README.md CHANGED
@@ -1,3 +1,243 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - zh
4
+ library_name: nemo
5
+ datasets:
6
+ - aishell_2
7
+ thumbnail: null
8
+ tags:
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - audio
12
+ - CTC
13
+ - Citrinet
14
+ - Transformer
15
+ - pytorch
16
+ - NeMo
17
+ - hf-asr-leaderboard
18
+ - Riva
19
  license: cc-by-4.0
20
+ model-index:
21
+ - name: stt_zh_citrinet_1024_gamma_0_25
22
+ results:
23
+ - task:
24
+ name: Automatic Speech Recognition
25
+ type: automatic-speech-recognition
26
+ dataset:
27
+ name: LibriSpeech (clean)
28
+ type: librispeech_asr
29
+ config: clean
30
+ split: test
31
+ args:
32
+ language: en
33
+ metrics:
34
+ - name: Test WER
35
+ type: wer
36
+ value: 2.2
37
+ - task:
38
+ type: Automatic Speech Recognition
39
+ name: automatic-speech-recognition
40
+ dataset:
41
+ name: LibriSpeech (other)
42
+ type: librispeech_asr
43
+ config: other
44
+ split: test
45
+ args:
46
+ language: en
47
+ metrics:
48
+ - name: Test WER
49
+ type: wer
50
+ value: 4.3
51
+ - task:
52
+ type: Automatic Speech Recognition
53
+ name: automatic-speech-recognition
54
+ dataset:
55
+ name: Multilingual LibriSpeech
56
+ type: facebook/multilingual_librispeech
57
+ config: english
58
+ split: test
59
+ args:
60
+ language: en
61
+ metrics:
62
+ - name: Test WER
63
+ type: wer
64
+ value: 7.2
65
+ - task:
66
+ type: Automatic Speech Recognition
67
+ name: automatic-speech-recognition
68
+ dataset:
69
+ name: Mozilla Common Voice 7.0
70
+ type: mozilla-foundation/common_voice_7_0
71
+ config: en
72
+ split: test
73
+ args:
74
+ language: en
75
+ metrics:
76
+ - name: Test WER
77
+ type: wer
78
+ value: 8.0
79
+ - task:
80
+ type: Automatic Speech Recognition
81
+ name: automatic-speech-recognition
82
+ dataset:
83
+ name: Mozilla Common Voice 8.0
84
+ type: mozilla-foundation/common_voice_8_0
85
+ config: en
86
+ split: test
87
+ args:
88
+ language: en
89
+ metrics:
90
+ - name: Test WER
91
+ type: wer
92
+ value: 9.48
93
+ - task:
94
+ type: Automatic Speech Recognition
95
+ name: automatic-speech-recognition
96
+ dataset:
97
+ name: Wall Street Journal 92
98
+ type: wsj_0
99
+ args:
100
+ language: en
101
+ metrics:
102
+ - name: Test WER
103
+ type: wer
104
+ value: 2.0
105
+ - task:
106
+ type: Automatic Speech Recognition
107
+ name: automatic-speech-recognition
108
+ dataset:
109
+ name: Wall Street Journal 93
110
+ type: wsj_1
111
+ args:
112
+ language: en
113
+ metrics:
114
+ - name: Test WER
115
+ type: wer
116
+ value: 2.9
117
+ - task:
118
+ type: Automatic Speech Recognition
119
+ name: automatic-speech-recognition
120
+ dataset:
121
+ name: National Singapore Corpus
122
+ type: nsc_part_1
123
+ args:
124
+ language: en
125
+ metrics:
126
+ - name: Test WER
127
+ type: wer
128
+ value: 7.0
129
  ---
130
+
131
+ # NVIDIA Streaming Citrinet 1024 (zh)
132
+
133
+ <style>
134
+ img {
135
+ display: inline;
136
+ }
137
+ </style>
138
+
139
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-Citrinet--CTC-lightgrey#model-badge)](#model-architecture)
140
+ | [![Model size](https://img.shields.io/badge/Params-140M-lightgrey#model-badge)](#model-architecture)
141
+ | [![Language](https://img.shields.io/badge/Language-zh--US-lightgrey#model-badge)](#datasets)
142
+ | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
143
+
144
+
145
+ This model utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in the Aishell-2 Mandard Corpus.
146
+ It is a non-autoregressive "large" variant of Citrinet, with around 140 million parameters.
147
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#citrinet) for complete architecture details.
148
+ It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
149
+
150
+
151
+ ## Usage
152
+
153
+ The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
154
+
155
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
156
+
157
+ ```
158
+ pip install nemo_toolkit['all']
159
+ ```
160
+
161
+ ### Automatically instantiate the model
162
+
163
+ ```python
164
+ import nemo.collections.asr as nemo_asr
165
+ asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained("nvidia/stt_zh_citrinet_1024_gamma_0_25")
166
+ ```
167
+
168
+ ### Transcribing using Python
169
+ First, let's get a sample of spoken Mandarin Chinese.
170
+
171
+ Then simply do:
172
+ ```
173
+ asr_model.transcribe(['<Path of audio file(s)>'])
174
+ ```
175
+
176
+ ### Transcribing many audio files
177
+
178
+ ```shell
179
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
180
+ pretrained_name="nvidia/stt_zh_citrinet_1024_gamma_0_25"
181
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
182
+ ```
183
+
184
+ ### Input
185
+
186
+ This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
187
+
188
+ ### Output
189
+
190
+ This model provides transcribed speech as a string for a given audio sample.
191
+
192
+ ## Model Architecture
193
+
194
+ Citrinet model is a non-autoregressive model [1] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Citrinet Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#citrinet).
195
+
196
+ ## Training
197
+
198
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/citrinet/citrinet_1024.yaml).
199
+
200
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
201
+
202
+ ### Datasets
203
+
204
+ All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
205
+
206
+ - AIShell 2
207
+
208
+ Note: older versions of the model may have trained on smaller set of datasets.
209
+
210
+ ## Performance
211
+
212
+ The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
213
+
214
+ | Version | Tokenizer | Vocabulary Size | Dev iOS | Test iOS | Dev Android | Test Android | Dev Mic | Test Mic | Train Dataset |
215
+ |---------|-----------|-----------------|---------|----------|-------------|--------------|---------|----------|---------------|
216
+ | 1.0.0 | Character | 5000+ | 4.8 | 5.1 | 5.2 | 5.5 | 5.2 | 5.5 | AIShell 2 |
217
+ | | | | | | | | | | |
218
+ | | | | | | | | | | |
219
+
220
+ While deploying with [NVIDIA Riva](https://developer.nvidia.com/riva), you can combine this model with external language models to further improve WER. The WER(%) of the latest model with different language modeling techniques are reported in the following table.
221
+
222
+ ## Limitations
223
+
224
+ Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
225
+
226
+ ## Deployment with NVIDIA Riva
227
+
228
+ For the best real-time accuracy, latency, and throughput, deploy the model with [NVIDIA Riva](https://developer.nvidia.com/riva), an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, at the edge, and embedded.
229
+ Additionally, Riva provides:
230
+
231
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
232
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
233
+ * Streaming speech recognition, Kubernetes compatible scaling, and Enterprise-grade support
234
+
235
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
236
+
237
+ ## References
238
+
239
+ - [1] [Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition](https://arxiv.org/abs/2104.01721)
240
+
241
+ - [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
242
+
243
+ - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)