hubert-emotion

Model Details

Hubert(Hidden-Unit BERT)๋Š” Facebook์—์„œ ์ œ์•ˆํ•œ Speech Representation Learning ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Hubert๋Š” ๊ธฐ์กด์˜ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ raw waveform์—์„œ ๋ฐ”๋กœ ํ•™์Šตํ•˜๋Š” self-supervised learning ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

https://huggingface.co/team-lucid/hubert-base-korean ๋ฅผ ๋ฒ ์ด์Šค๋ชจ๋ธ๋กœ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

How to Get Started with the Model

Pytorch

import torch
import librosa
from transformers import AutoFeatureExtractor, AutoConfig
import whisper
from pytorch_lightning import Trainer
import pytorch_lightning as pl
from torch import nn
from transformers import HubertForSequenceClassification

class MyLitModel(pl.LightningModule):
    def __init__(self, audio_model_name, num_label2s, n_layers=1, projector=True, classifier=True, dropout=0.07, lr_decay=1):
        super(MyLitModel, self).__init__()
        self.config = AutoConfig.from_pretrained(audio_model_name)
        self.config.output_hidden_states = True
        self.audio_model = HubertForSequenceClassification.from_pretrained(audio_model_name, config=self.config)
        self.label2_classifier = nn.Linear(self.audio_model.config.hidden_size, num_label2s)
        self.intensity_regressor = nn.Linear(self.audio_model.config.hidden_size, 1)

    def forward(self, audio_values, audio_attn_mask=None):
        outputs = self.audio_model(input_values=audio_values, attention_mask=audio_attn_mask)
        label2_logits = self.label2_classifier(outputs.hidden_states[-1][:, 0, :])
        intensity_preds = self.intensity_regressor(outputs.hidden_states[-1][:, 0, :]).squeeze(-1)
        return label2_logits, intensity_preds

# ๋ชจ๋ธ ๊ด€๋ จ ์„ค์ •
audio_model_name = "team-lucid/hubert-base-korean"
NUM_LABELS = 7
SAMPLING_RATE = 16000

# Hubert ๋ชจ๋ธ ๋กœ๋“œ
pretrained_model_path = "" # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ
hubert_model = MyLitModel.load_from_checkpoint(
    pretrained_model_path,
    audio_model_name=audio_model_name,
    num_label2s=NUM_LABELS,
)
hubert_model.eval()
hubert_model.to("cuda" if torch.cuda.is_available() else "cpu")

# Feature extractor ๋กœ๋“œ
feature_extractor = AutoFeatureExtractor.from_pretrained(audio_model_name)

# ์Œ์„ฑ ํŒŒ์ผ ์ฒ˜๋ฆฌ
audio_path = ""  # ์ฒ˜๋ฆฌํ•  ์Œ์„ฑ ํŒŒ์ผ ๊ฒฝ๋กœ
audio_np, _ = librosa.load(audio_path, sr=SAMPLING_RATE, mono=True)
inputs = feature_extractor(raw_speech=audio_np, return_tensors="pt", sampling_rate=SAMPLING_RATE)
audio_values = inputs["input_values"].to(hubert_model.device)
audio_attn_mask = inputs.get("attention_mask", None)
if audio_attn_mask is not None:
    audio_attn_mask = audio_attn_mask.to(hubert_model.device)

# ๊ฐ์ • ๋ถ„์„
with torch.no_grad():
    if audio_attn_mask is None:
        label2_logits, intensity_preds = hubert_model(audio_values)
    else:
        label2_logits, intensity_preds = hubert_model(audio_values, audio_attn_mask)

emotion_label = torch.argmax(label2_logits, dim=-1).item()
emotion_intensity = intensity_preds.item()

print(f"Emotion Label: {emotion_label}, Emotion Intensity: {emotion_intensity}")



Training Details

Training Data

ํ•ด๋‹น ๋ชจ๋ธ์€ AI hub์˜ ๊ฐ์ • ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋Œ€ํ™”์Œ์„ฑ๋ฐ์ดํ„ฐ์…‹ (https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=263) ์ค‘ ๊ฐ ๋ผ๋ฒจ ๋ณ„ ๋ฐ์ดํ„ฐ์…‹ 1000๊ฐœ์”ฉ, ์ด 7000๊ฐœ๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Training Procedure

๊ฐ 7๊ฐ€์ง€ ๊ฐ์ • (ํ–‰๋ณต, ๋ถ„๋…ธ, ํ˜์˜ค, ๊ณตํฌ, ์ค‘๋ฆฝ, ์Šฌํ””, ๋†€๋žŒ)๊ณผ ๊ฐ ๊ฐ์ •์˜ ๊ฐ•๋„(0-2)๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ๋ฉ€ํ‹ฐํ…Œ์Šคํฌ ๋ชจ๋ธ๋กœ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

Training Hyperparameters

Hyperparameter Base
Learning Rates 1e-5
Learning Rate Decay 0.8
Batch Size 8
Weight Decay 0.01
Epoch 30
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.