slplab
/

wav2vec2-large-robust_ETRI_Korean_english-pronunciation

Automatic Speech Recognition

speech-recognition

english-phoneme-recognition

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

wav2vec2-large-robust_ETRI_Korean_english-pronunciation / README.md

slplab's picture

Update README.md

7302b2a verified 2 months ago

|

2.12 kB

	---
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	tags:
	- wav2vec2
	- speech-recognition
	- english-phoneme-recognition
	---

	# Wav2Vec2-Large-Robust ETRI Korean-English Pronunciation Model

	This repository contains a fine-tuned Wav2Vec2-Large-Robust model for phoneme recognition tasks. The model was trained and evaluated on our in-house dataset, English pronunciations of Korean learners made with ETRI.

	## Data Information
	- Dataset Name: ETRI English Pronunciation of Korean Learners
	- Train Data: 14,305 samples
	- Valid Data: 1,590 samples
	- Test Data: 3,974 samples

	## Training Procedure
	The model was fine-tuned for phoneme recognition using the Hugging Face `transformers` library. Below are the training steps:
	1. Data preprocessing to align audio with phoneme labels.
	2. Wav2Vec2-Large-Robust model fine-tuning with CTC loss.
	3. Evaluation on validation and test sets.

	### Training Hyperparameters
	- Epochs: 50
	- Learning Rate: 0.0001
	- Warmup Ratio: 0.1
	- Scheduler: Linear
	- Batch Size: 8
	- Loss Reduction: Mean
	- Feature Extractor Freeze: Enabled

	## Training Results
	The following metrics were achieved during training:
	- Final Training Loss: 0.2527
	- Validation Loss: 0.4532
	- Word Error Rate (WER) on Validation Set: 0.1617

	## Test Results
	The model was evaluated on the test dataset with the following performance:
	- Word Error Rate (WER): 0.1223

	## Phoneme Data Example
	Below is an example of how the dataset is structured for phoneme recognition tasks:

	Sample 1:
	- Provided Sentence: The one with the ribbon on its head
	- Correct Korean English Phonemes: dh ah w ah n w ih dh ax r ih b ah n ao n ih t s hh eh dd
	- Predicted Phonemes: d ah w ah n w ih dh ah r ih b ah n ao n ih ts hh eh dd

	## Training Logs
	TensorBoard logs are available for detailed training analysis:
	- `events.out.tfevents.1732529747.oem-WS-C621E-SAGE-Series.2265579.0`
	- `events.out.tfevents.1732573537.oem-WS-C621E-SAGE-Series.2265579.1`

	Use the following command to visualize logs:
	```bash
	tensorboard --logdir=./logs/