steveheh commited on
Commit
999faf8
·
verified ·
1 Parent(s): cce3132

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -17
README.md CHANGED
@@ -11,15 +11,15 @@ tags:
11
  - audio
12
  ---
13
 
14
- # Model Overview
15
- ## Description:
16
  The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours. <br>
17
  This model is ready for commercial/non-commercial use. <br>
18
 
19
- ### License:
20
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
21
 
22
- ## Reference:
23
  [1] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) <br>
24
  [2] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) <br>
25
  [3] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) <br>
@@ -27,7 +27,7 @@ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.
27
  [5] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) <br>
28
  [6] [Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling](https://arxiv.org/abs/2307.07057)<br>
29
 
30
- ## Model Architecture:
31
 
32
  **Architecture Type:** NEST [1] <br>
33
 
@@ -38,31 +38,32 @@ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.
38
  - Augmentor: Speaker/noise augmentation
39
  - Loss: Cross-entropy on masked positions <br>
40
 
41
- ## Input:
42
  **Input Type(s):** Audio <br>
43
  **Input Format(s):** wav files <br>
44
  **Input Parameters:** One-Dimensional (1D) <br>
45
  **Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
46
 
47
- ## Output:
48
  **Output Type(s):** Audio features <br>
49
  **Output Format:** Audio embeddings <br>
50
  **Output Parameters:** Feature sequence (2D) <br>
51
  **Other Properties Related to Output:** Audio feature sequence of shape [D,T] <br>
52
 
53
 
54
- ## Model Version(s):
55
  `ssl_en_nest_large_v1.0` <br>
56
 
57
 
58
- ## How to Use the Model:
59
  The model is available for use in the NVIDIA NeMo Framework [2], and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
60
- ### Loading the whole model:
 
61
  ```python
62
  from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
63
  nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_large_v1.0")
64
  ```
65
- ### Using NEST encoder as weight initialization for downstream tasks:
66
  ```bash
67
  # use ASR as example:
68
  python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
@@ -87,11 +88,12 @@ python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
87
  exp_manager.wandb_logger_kwargs.project="<Name of project>"
88
  ```
89
  More details can be found at [maybe_init_from_pretrained_checkpoint()](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1236).
90
- ### Using NEST as a frozen feature extractor:
 
91
  NEST can also be used as a frozen feature extractor for downstream tasks. For example, in the case of speaker verification, embeddings can be extracted from different layers of the NEST model, and a learned weighted combination of those embeddings can be used as input to the speaker verification model.
92
  Please refer to this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/downstream/speech_classification_mfa_train.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/multi_layer_feat/nest_titanet_small.yaml) for details.
93
 
94
- ### Extracting audio features from NEST
95
 
96
  NEST supports extracting audio features from multiple layers of its encoder:
97
  ```bash
@@ -105,7 +107,7 @@ python <NeMo Root>/scripts/ssl/extract_features.py \
105
  ```
106
 
107
  ## Training
108
- The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for training the model for two hundred epochs. Model is trained with this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/masked_token_pred_pretrain.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/nest_fast-conformer.yaml).
109
  ## Training Datasets
110
  - [LibriLight](https://github.com/facebookresearch/libri-light)
111
  - Data Collection Method: Human
@@ -118,7 +120,7 @@ The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for tra
118
  - Labeling Method: Hybrid: Automated, Human
119
  <br>
120
 
121
- ## Inference:
122
  **Engine:** NVIDIA NeMo <br>
123
  **Test Hardware:** <br>
124
  * A6000 <br>
@@ -167,7 +169,7 @@ Model | Intent Acc | SLURP F1
167
  ssl_en_nest_large_v1.0 | 89.79 | 79.61
168
  ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
169
 
170
- ## Software Integration:
171
 
172
  **Runtime Engine(s):**
173
  * [NeMo-2.0] <br>
@@ -187,7 +189,7 @@ ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
187
  * [Windows] <br>
188
 
189
 
190
- ## Ethical Considerations:
191
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
192
 
193
  For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].
 
11
  - audio
12
  ---
13
 
14
+ # NVIDIA NEST Large En
15
+
16
  The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours. <br>
17
  This model is ready for commercial/non-commercial use. <br>
18
 
19
+ ### License
20
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
21
 
22
+ ## Reference
23
  [1] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) <br>
24
  [2] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) <br>
25
  [3] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) <br>
 
27
  [5] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) <br>
28
  [6] [Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling](https://arxiv.org/abs/2307.07057)<br>
29
 
30
+ ## Model Architecture
31
 
32
  **Architecture Type:** NEST [1] <br>
33
 
 
38
  - Augmentor: Speaker/noise augmentation
39
  - Loss: Cross-entropy on masked positions <br>
40
 
41
+ ### Input
42
  **Input Type(s):** Audio <br>
43
  **Input Format(s):** wav files <br>
44
  **Input Parameters:** One-Dimensional (1D) <br>
45
  **Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
46
 
47
+ ### Output
48
  **Output Type(s):** Audio features <br>
49
  **Output Format:** Audio embeddings <br>
50
  **Output Parameters:** Feature sequence (2D) <br>
51
  **Other Properties Related to Output:** Audio feature sequence of shape [D,T] <br>
52
 
53
 
54
+ ## Model Version(s)
55
  `ssl_en_nest_large_v1.0` <br>
56
 
57
 
58
+ ## How to Use the Model
59
  The model is available for use in the NVIDIA NeMo Framework [2], and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
60
+
61
+ ### Loading the whole model
62
  ```python
63
  from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
64
  nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_large_v1.0")
65
  ```
66
+ ### Using NEST as weight initialization for downstream tasks
67
  ```bash
68
  # use ASR as example:
69
  python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
 
88
  exp_manager.wandb_logger_kwargs.project="<Name of project>"
89
  ```
90
  More details can be found at [maybe_init_from_pretrained_checkpoint()](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1236).
91
+
92
+ ### Using NEST as Frozen Feature Extractor
93
  NEST can also be used as a frozen feature extractor for downstream tasks. For example, in the case of speaker verification, embeddings can be extracted from different layers of the NEST model, and a learned weighted combination of those embeddings can be used as input to the speaker verification model.
94
  Please refer to this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/downstream/speech_classification_mfa_train.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/multi_layer_feat/nest_titanet_small.yaml) for details.
95
 
96
+ ### Extracting Audio Features from NEST
97
 
98
  NEST supports extracting audio features from multiple layers of its encoder:
99
  ```bash
 
107
  ```
108
 
109
  ## Training
110
+ The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for training the model. Model is trained with this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/masked_token_pred_pretrain.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/nest_fast-conformer.yaml).
111
  ## Training Datasets
112
  - [LibriLight](https://github.com/facebookresearch/libri-light)
113
  - Data Collection Method: Human
 
120
  - Labeling Method: Hybrid: Automated, Human
121
  <br>
122
 
123
+ ## Inference
124
  **Engine:** NVIDIA NeMo <br>
125
  **Test Hardware:** <br>
126
  * A6000 <br>
 
169
  ssl_en_nest_large_v1.0 | 89.79 | 79.61
170
  ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
171
 
172
+ ## Software Integration
173
 
174
  **Runtime Engine(s):**
175
  * [NeMo-2.0] <br>
 
189
  * [Windows] <br>
190
 
191
 
192
+ ## Ethical Considerations
193
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
194
 
195
  For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].