Rifat Mamayusupov
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -8,60 +8,15 @@ tags:
|
|
8 |
|
9 |
# Speaker-verification-v2
|
10 |
|
11 |
-
<style>
|
12 |
-
img {
|
13 |
-
display: inline;
|
14 |
-
}
|
15 |
-
</style>
|
16 |
-
|
17 |
-
[![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
|
18 |
-
| [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
|
19 |
-
| [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
|
20 |
-
|
21 |
-
**Put a short model description here.**
|
22 |
-
|
23 |
-
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
|
24 |
-
|
25 |
-
|
26 |
-
## NVIDIA NeMo: Training
|
27 |
-
|
28 |
-
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
|
29 |
-
```
|
30 |
-
pip install nemo_toolkit['all']
|
31 |
-
```
|
32 |
|
33 |
## How to Use this Model
|
34 |
|
35 |
-
The model is available for use in the
|
36 |
|
37 |
### Automatically instantiate the model
|
38 |
|
39 |
**NOTE**: Please update the model class below to match the class of the model being uploaded.
|
40 |
|
41 |
-
```python
|
42 |
-
import nemo.core import ModelPT
|
43 |
-
model = ModelPT.from_pretrained("ai-nightcoder/speaker-verification-v2")
|
44 |
-
```
|
45 |
-
|
46 |
-
### NOTE
|
47 |
-
|
48 |
-
Add some information about how to use the model here. An example is provided for ASR inference below.
|
49 |
-
|
50 |
-
### Transcribing using Python
|
51 |
-
First, let's get a sample
|
52 |
-
```
|
53 |
-
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
|
54 |
-
```
|
55 |
-
Then simply do:
|
56 |
-
```
|
57 |
-
asr_model.transcribe(['2086-149220-0033.wav'])
|
58 |
-
```
|
59 |
-
|
60 |
-
### Transcribing many audio files
|
61 |
-
|
62 |
-
```shell
|
63 |
-
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="ai-nightcoder/speaker-verification-v2" audio_dir=""
|
64 |
-
```
|
65 |
|
66 |
### Input
|
67 |
|
@@ -79,14 +34,6 @@ model = ModelPT.from_pretrained("ai-nightcoder/speaker-verification-v2")
|
|
79 |
|
80 |
**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
|
81 |
|
82 |
-
### NOTE
|
83 |
-
|
84 |
-
An example is provided below for ASR
|
85 |
-
|
86 |
-
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
|
87 |
-
|
88 |
-
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
89 |
-
|
90 |
|
91 |
### Datasets
|
92 |
|
@@ -113,8 +60,6 @@ model = ModelPT.from_pretrained("ai-nightcoder/speaker-verification-v2")
|
|
113 |
|
114 |
The corresponding text in this section for those datasets is stated below -
|
115 |
|
116 |
-
The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
|
117 |
-
|
118 |
The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
|
119 |
|
120 |
- Librispeech 960 hours of English speech
|
@@ -171,7 +116,6 @@ model = ModelPT.from_pretrained("ai-nightcoder/speaker-verification-v2")
|
|
171 |
|
172 |
Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
|
173 |
|
174 |
-
It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
|
175 |
|
176 |
## Limitations
|
177 |
|
@@ -185,12 +129,3 @@ It should ideally be in a tabular format (you can use the following website to m
|
|
185 |
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
186 |
|
187 |
|
188 |
-
## License
|
189 |
-
|
190 |
-
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
191 |
-
|
192 |
-
## References
|
193 |
-
|
194 |
-
**Provide appropriate references in the markdown link format below. Please order them numerically.**
|
195 |
-
|
196 |
-
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
|
|
8 |
|
9 |
# Speaker-verification-v2
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## How to Use this Model
|
13 |
|
14 |
+
The model is available for use in the and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
15 |
|
16 |
### Automatically instantiate the model
|
17 |
|
18 |
**NOTE**: Please update the model class below to match the class of the model being uploaded.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
### Input
|
22 |
|
|
|
34 |
|
35 |
**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
### Datasets
|
39 |
|
|
|
60 |
|
61 |
The corresponding text in this section for those datasets is stated below -
|
62 |
|
|
|
|
|
63 |
The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
|
64 |
|
65 |
- Librispeech 960 hours of English speech
|
|
|
116 |
|
117 |
Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
|
118 |
|
|
|
119 |
|
120 |
## Limitations
|
121 |
|
|
|
129 |
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
130 |
|
131 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|