language:
- kr
license: cc-by-4.0
library_name: nemo
datasets:
- RealCallData
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Citrinet1024
- NeMo
- pytorch
model-index:
- name: stt_kr_citrinet1024_PublicCallCenter_1000H_0.26
results: []
Model Overview
NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.
pip install nemo_toolkit['all']
How to Use this Model
The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("ypluit/stt_kr_citrinet1024_PublicCallCenter_1000H_0.26")
Transcribing using Python
First, let's get a sample
get any korean telephone voice wave file
Then simply do:
asr_model.transcribe(['sample-kr.wav'])
Transcribing many audio files
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="model" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Input
This model accepts 16000Hz Mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
Model Architecture
See nemo toolkit and reference papers.
Training
Learned about 20 days on 2 A6000
Datasets
Private call center real data (1200hour)
Performance
0.26 WER
Limitations
This model was trained with 1200 hours of Korean telephone voice data for customer service in a call center. might be Poor performance for general-purpose dialogue and specific accents.