File size: 5,577 Bytes
fd69533
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: cc-by-4.0
library_name: nemo
tags:
- pytorch
- NeMo
---

# Uzbek-speaker-verification-v4


## How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### Automatically instantiate the model

**NOTE**: Please update the model class below to match the class of the model being uploaded.

```python
import nemo.core import ModelPT
model = ModelPT.from_pretrained("ai-nightcoder/uzbek-speaker-verification-v4")
```

### NOTE

    Add some information about how to use the model here. An example is provided for ASR inference below.

    ### Transcribing using Python
    First, let's get a sample
    ```
    wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
    ```
    Then simply do:
    ```
    asr_model.transcribe(['2086-149220-0033.wav'])
    ```

    ### Transcribing many audio files

    ```shell
    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py      pretrained_name="ai-nightcoder/uzbek-speaker-verification-v4"      audio_dir=""
    ```

### Input

**Add some information about what are the inputs to this model**

### Output

**Add some information about what are the outputs of this model**

## Model Architecture

**Add information here discussing architectural details of the model or any comments to users about the model.**

## Training

**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**

### NOTE

    An example is provided below for ASR

    The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).

    The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).


### Datasets

**Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**

### NOTE

    An example for the manifest section is provided below for ASR datasets

    datasets:
    - librispeech_asr
    - fisher_corpus
    - Switchboard-1
    - WSJ-0
    - WSJ-1
    - National-Singapore-Corpus-Part-1
    - National-Singapore-Corpus-Part-6
    - vctk
    - voxpopuli
    - europarl
    - multilingual_librispeech
    - mozilla-foundation/common_voice_8_0
    - MLCommons/peoples_speech

    The corresponding text in this section for those datasets is stated below -

    The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.

    The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:

    - Librispeech 960 hours of English speech
    - Fisher Corpus
    - Switchboard-1 Dataset
    - WSJ-0 and WSJ-1
    - National Speech Corpus (Part 1, Part 6)
    - VCTK
    - VoxPopuli (EN)
    - Europarl-ASR (EN)
    - Multilingual Librispeech (MLS EN) - 2,000 hour subset
    - Mozilla Common Voice (v7.0)
    - People's Speech  - 12,000 hour subset


## Performance

**Add information here about the performance of the model. Discuss what is the metric that is being used to evaluate the model and if there are external links explaning the custom metric, please link to it.

### NOTE

    An example is provided below for ASR metrics list that can be added to the top of the README
    
    model-index:
    - name: PUT_MODEL_NAME
      results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AMI (Meetings test)
          type: edinburghcstr/ami
          config: ihm
          split: test
          args:
            language: en
        metrics:
        - name: Test WER
          type: wer
          value: 17.10
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Earnings-22
          type: revdotcom/earnings22
          split: test
          args:
            language: en
        metrics:
        - name: Test WER
          type: wer
          value: 14.11

Provide any caveats about the results presented in the top of the discussion so that nuance is not lost. 

It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**

## Limitations

**Discuss any practical limitations to the model when being used in real world cases. They can also be legal disclaimers, or discussion regarding the safety of the model (particularly in the case of LLMs).**


### Note

    An example is provided below 

    Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.