AST Fine-Tuned Model for Emotion Classification

AST Fine-Tuned Model for Emotion Classification

This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.


Model Details

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Fine-Tuned Dataset: CREMA-D
  • Architecture: Audio Spectrogram Transformer (AST)
  • Model Type: Single-label classification
  • Input Features: Log-Mel Spectrograms (128 mel bins)
  • Output Classes:
    • ANG: Anger
    • DIS: Disgust
    • FEA: Fear
    • HAP: Happiness
    • NEU: Neutral
    • SAD: Sadness

Model Configuration

  • Hidden Size: 768
  • Number of Attention Heads: 12
  • Number of Hidden Layers: 12
  • Patch Size: 16
  • Maximum Length: 1024
  • Dropout Probability: 0.0
  • Activation Function: GELU (Gaussian Error Linear Unit)
  • Optimizer: Adam
  • Learning Rate: 1e-4

Training Details

  • Dataset: CREMA-D (Emotion-Labeled Speech Data)
  • Data Augmentation:
    • Noise injection
    • Time shifting
    • Speed perturbation
  • Fine-Tuning Epochs: 5
  • Batch Size: 16
  • Learning Rate Scheduler: Linear decay
  • Best Validation Accuracy: 60.71%
  • Best Checkpoint: ./results/checkpoint-1119

How to Use

Load the Model

from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")

Metrics

Validation Results

  • Best Validation Accuracy: 60.71%
  • Validation Loss: 1.1126

Evaluation Details

  • Eval Dataset: CREMA-D test split
  • Batch Size: 16
  • Number of Steps: 94

Limitations

  • The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
  • Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

Acknowledgments

This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.


License

The model is shared under the MIT License. Refer to the licensing details in the repository.


Citation

If you use this model in your work, please cite:

@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}

Contact

For questions, reach out to [email protected].

Downloads last month
6
Safetensors
Model size
86.2M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for forwarder1121/ast-finetuned-model

Finetuned
(102)
this model