Merge branch 'main' of https://huggingface.co/projecte-aina/stt-ca-citrinet-512 into main
Browse files
README.md
CHANGED
@@ -9,24 +9,97 @@ tags:
|
|
9 |
- automatic-speech-recognition
|
10 |
- speech
|
11 |
- audio
|
|
|
12 |
- citrinet
|
13 |
- pytorch
|
14 |
- NeMo
|
15 |
-
|
16 |
widget:
|
17 |
-
- example_title:
|
18 |
-
src: https://huggingface.co/projecte-aina/stt-ca-citrinet-512/samples/common_voice_ca_34667058.wav
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
---
|
20 |
|
21 |
-
# Aina Project's Catalan
|
22 |
## Model description
|
23 |
|
24 |
-
This model was fine-tuned from a pre-trained Spanish [stt-es-citrinet-512](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_citrinet_512) model using the [NeMo](https://github.com/NVIDIA/NeMo) toolkit
|
25 |
|
26 |
## Intended uses and limitations
|
27 |
|
28 |
-
You can use this model for Automatic Speech Recognition (ASR) in
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## Additional information
|
32 |
|
@@ -39,12 +112,11 @@ For further information, send an email to [email protected]
|
|
39 |
### Copyright
|
40 |
Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
|
41 |
|
42 |
-
|
43 |
### Licensing Information
|
44 |
-
[
|
45 |
|
46 |
### Funding
|
47 |
-
This work was funded by the [
|
48 |
|
49 |
|
50 |
## Disclaimer
|
|
|
9 |
- automatic-speech-recognition
|
10 |
- speech
|
11 |
- audio
|
12 |
+
- CTC
|
13 |
- citrinet
|
14 |
- pytorch
|
15 |
- NeMo
|
16 |
+
license: cc-by-4.0
|
17 |
widget:
|
18 |
+
- example_title: CV sample 1
|
19 |
+
src: https://huggingface.co/projecte-aina/stt-ca-citrinet-512/tree/main/samples/common_voice_ca_34667058.wav
|
20 |
+
model-index:
|
21 |
+
- name: stt-ca-citrinet-512
|
22 |
+
results:
|
23 |
+
- task:
|
24 |
+
name: Automatic Speech Recognition
|
25 |
+
type: automatic-speech-recognition
|
26 |
+
dataset:
|
27 |
+
name: Mozilla Common Voice 11.0
|
28 |
+
type: mozilla-foundation/common_voice_11_0
|
29 |
+
config: ca
|
30 |
+
split: test
|
31 |
+
args:
|
32 |
+
language: en
|
33 |
+
metrics:
|
34 |
+
- name: Test WER
|
35 |
+
type: wer
|
36 |
+
value: 6.684
|
37 |
---
|
38 |
|
39 |
+
# Aina Project's Catalan text-to-speech model
|
40 |
## Model description
|
41 |
|
42 |
+
This model transcribes audio samples in Catalan to lowercase text without punctuation. The model was fine-tuned from a pre-trained Spanish [stt-es-citrinet-512](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_citrinet_512) model using the [NeMo](https://github.com/NVIDIA/NeMo) toolkit. It has around 36.5M parámeters and has been trained on [Common Voice 11.0](https://commonvoice.mozilla.org/en/datasets).
|
43 |
|
44 |
## Intended uses and limitations
|
45 |
|
46 |
+
You can use this model for Automatic Speech Recognition (ASR) in Catalan, to transcribe audio files in Catalan to plain text without punctuation.
|
47 |
|
48 |
+
## How to use
|
49 |
+
### Usage
|
50 |
+
|
51 |
+
Requiered libraries:
|
52 |
+
|
53 |
+
```bash
|
54 |
+
pip install nemo_toolkit['all']
|
55 |
+
```
|
56 |
+
|
57 |
+
Clone the repository to download the model:
|
58 |
+
|
59 |
+
```bash
|
60 |
+
git clone https://huggingface.co/projecte-aina/stt-ca-citrinet-512
|
61 |
+
```
|
62 |
+
|
63 |
+
Given that `NEMO_PATH` is the path that points to the downloaded `stt-ca-citrinet-512.nemo` file, to do inference over a set of `.wav` files you should:
|
64 |
+
|
65 |
+
```python
|
66 |
+
# Load the model
|
67 |
+
model = nemo_asr.models.EncDecCTCModel.restore_from(NEMO_PATH)
|
68 |
+
|
69 |
+
# Create a list pointing to the audio files
|
70 |
+
paths2audio_files = ["audio_1.wav", ..., "audio_n.wav"]
|
71 |
+
|
72 |
+
# Fix the batch size to whatever number suits your purpose
|
73 |
+
batch_size = 8
|
74 |
+
|
75 |
+
# Transcribe the audio files
|
76 |
+
transcriptions = model.transcribe(paths2audio_files=paths2audio_files,
|
77 |
+
batch_size=batch_size)
|
78 |
+
# Visualize the transcriptions
|
79 |
+
print(transcriptions)
|
80 |
+
|
81 |
+
```
|
82 |
+
|
83 |
+
## Training data
|
84 |
+
|
85 |
+
This model has been trained on the training split of the Catalan version of [Common Voice 11.0](https://commonvoice.mozilla.org/en/datasets).
|
86 |
+
|
87 |
+
## Training
|
88 |
+
### Data preparation
|
89 |
+
We have processed [Common Voice 11.0](https://commonvoice.mozilla.org/en/datasets) using the NeMo toolkit. We used [get_commonvoice_data.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/get_commonvoice_data.py) to process the manifests and made a subsequent data cleaning step.
|
90 |
+
|
91 |
+
After cleaning the dataset and normalizing the `ñ` character to `ny`, we have used the following charset to create the final NeMo manifests for training.
|
92 |
+
```python
|
93 |
+
['c', ' ', 'ó', 'g', 'a', 'o', 'ü', 'v', 'p', 't', "'", '—', 'f', 'k', 'à', 'ï', 'í', 'ú', 'd', 'l', 'z', 'é', 'w', 'm', 'r', 'n', 'y', '-', 'u', 'i', 'h', 'ç', 'e', '·', 'q', 'è', 'ò', 'j', 'x', 's', 'b']
|
94 |
+
```
|
95 |
+
|
96 |
+
### Training procedure
|
97 |
+
This model was trained starting from a pre-trained Spanish [stt-es-citrinet-512](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_citrinet_512) model. The initial learning rate was set to 0.005 and the minimum lr for weight decay was set to 1e-7.
|
98 |
+
|
99 |
+
The model was trained for 90 steps and then continued training for another 90 steps starting from a learning rate of 1e-4.
|
100 |
+
|
101 |
+
## Evaluation
|
102 |
+
After evaluation on the test split of Common Voice 11.0 we have obtained a WER of 6.684.
|
103 |
|
104 |
## Additional information
|
105 |
|
|
|
112 |
### Copyright
|
113 |
Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
|
114 |
|
|
|
115 |
### Licensing Information
|
116 |
+
[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
|
117 |
|
118 |
### Funding
|
119 |
+
This work was funded by the [Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
120 |
|
121 |
|
122 |
## Disclaimer
|