File size: 2,115 Bytes
84b497f
 
 
 
 
 
 
 
e259073
 
 
7302b2a
e259073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84b497f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- wav2vec2
- speech-recognition
- english-phoneme-recognition
---

# Wav2Vec2-Large-Robust ETRI Korean-English Pronunciation Model

This repository contains a fine-tuned Wav2Vec2-Large-Robust model for phoneme recognition tasks. The model was trained and evaluated on our in-house dataset, English pronunciations of Korean learners made with ETRI.

## Data Information
- **Dataset Name**: ETRI English Pronunciation of Korean Learners
- **Train Data**: 14,305 samples
- **Valid Data**: 1,590 samples
- **Test Data**: 3,974 samples

## Training Procedure
The model was fine-tuned for phoneme recognition using the Hugging Face `transformers` library. Below are the training steps:
1. Data preprocessing to align audio with phoneme labels.
2. Wav2Vec2-Large-Robust model fine-tuning with CTC loss.
3. Evaluation on validation and test sets.

### Training Hyperparameters
- **Epochs**: 50
- **Learning Rate**: 0.0001
- **Warmup Ratio**: 0.1
- **Scheduler**: Linear
- **Batch Size**: 8
- **Loss Reduction**: Mean
- **Feature Extractor Freeze**: Enabled

## Training Results
The following metrics were achieved during training:
- **Final Training Loss**: 0.2527
- **Validation Loss**: 0.4532
- **Word Error Rate (WER) on Validation Set**: 0.1617

## Test Results
The model was evaluated on the test dataset with the following performance:
- **Word Error Rate (WER)**: 0.1223

## Phoneme Data Example
Below is an example of how the dataset is structured for phoneme recognition tasks:

**Sample 1:**
- **Provided Sentence**: The one with the ribbon on its head
- **Correct Korean English Phonemes**: dh ah w ah n w ih dh ax r ih b ah n ao n ih t s hh eh dd
- **Predicted Phonemes**: d ah w ah n w ih dh ah r ih b ah n ao n ih ts hh eh dd

## Training Logs
TensorBoard logs are available for detailed training analysis:
- `events.out.tfevents.1732529747.oem-WS-C621E-SAGE-Series.2265579.0`
- `events.out.tfevents.1732573537.oem-WS-C621E-SAGE-Series.2265579.1`

Use the following command to visualize logs:
```bash
tensorboard --logdir=./logs/