File size: 2,023 Bytes
24a9843
 
 
 
 
 
 
 
 
 
 
 
 
13f4341
 
 
 
d6a4b3d
24a9843
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68acc64
24a9843
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
language:
- ko
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- conformer-ctc
metrics:
- wer
---
# Conformer-ctc-medium-ko
ํ•ด๋‹น ๋ชจ๋ธ์€ [RIVA Conformer ASR Korean](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_ko_kr_conformer)์„ AI hub dataset์— ๋Œ€ํ•ด ํŒŒ์ธํŠœ๋‹์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. <br>
Conformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์€ whisper์™€ ๊ฐ™์€ attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ streaming์„ ์ง„ํ–‰ํ•˜์—ฌ๋„ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง€์ง€ ์•Š๊ณ , ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.<br>
V100 GPU์—์„œ๋Š” RTF๊ฐ€ 0.05, CPU(7 cores)์—์„œ๋Š” 0.35 ์ •๋„ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.<br>
์˜ค๋””์˜ค chunk size 2์ดˆ์˜ streaming ํ…Œ์ŠคํŠธ์—์„œ๋Š” ์ „์ฒด ์˜ค๋””์˜ค๋ฅผ ๋„ฃ๋Š” ๊ฒƒ์— ๋น„ํ•ด์„œ๋Š” 20% ์ •๋„ ์„ฑ๋Šฅ์ €ํ•˜๊ฐ€ ์žˆ์œผ๋‚˜ ์ถฉ๋ถ„ํžˆ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ์ž…๋‹ˆ๋‹ค.<br>
์ถ”๊ฐ€๋กœ open domain์ด ์•„๋‹Œ ๊ณ ๊ฐ ์‘๋Œ€ ์Œ์„ฑ๊ณผ ๊ฐ™์€ domain์—์„œ๋Š” kenlm์„ ์ถ”๊ฐ€ํ•˜์˜€์„ ๋•Œ WER 13.45์—์„œ WER 5.27๋กœ ํฌ๊ฒŒ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.<br>
ํ•˜์ง€๋งŒ ๊ทธ ์™ธ์˜ domain์—์„œ๋Š” kenlm์˜ ์ถ”๊ฐ€๊ฐ€ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.


### dataset

| ๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„ | ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ ์ˆ˜(train/test) |
| --- | --- |
| ๊ณ ๊ฐ์‘๋Œ€์Œ์„ฑ | 2067668/21092 |
| ํ•œ๊ตญ์–ด ์Œ์„ฑ | 620000/3000 |
| ํ•œ๊ตญ์ธ ๋Œ€ํ™” ์Œ์„ฑ | 2483570/142399 |
| ์ž์œ ๋Œ€ํ™”์Œ์„ฑ(์ผ๋ฐ˜๋‚จ๋…€) | 1886882/263371 |
| ๋ณต์ง€ ๋ถ„์•ผ ์ฝœ์„ผํ„ฐ ์ƒ๋‹ด๋ฐ์ดํ„ฐ | 1096704/206470 |
| ์ฐจ๋Ÿ‰๋‚ด ๋Œ€ํ™” ๋ฐ์ดํ„ฐ | 2624132/332787 |
| ๋ช…๋ น์–ด ์Œ์„ฑ(๋…ธ์ธ๋‚จ์—ฌ) | 137467/237469 |
| ์ „์ฒด | 10916423(13946์‹œ๊ฐ„)/1206588(1474์‹œ๊ฐ„) |


## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- num_train_epoch: 1
- sample_rate: 16000
- max_duration: 20.0

### Training results

| Training Loss | Epoch | Wer     |
|:-------------:|:-----:|:-------:|
| 9.09          |  1.0  | 11.51   |