Model Card for Fine-Tuned gemma-2-2b-it on Custom Korean Sentiment Dataset

Model Summary

This model is a fine-tuned version of google/gemma-2-2b-it, trained to classify sentiment in Korean text into four categories: 무감정 (neutral), 슬픔 (sadness), 기쁨 (joy), and 분노 (anger). The model utilizes LoRA (Low-Rank Adaptation) for efficient fine-tuning and 4-bit quantization (NF4) for memory efficiency using BitsAndBytes. A custom weighted loss function was applied to handle class imbalance within the dataset.

The model is suitable for multi-class sentiment classification in Korean and is optimized for environments with limited computational resources due to the quantization.

Model Details

Developed By:

This model was fine-tuned by [Your Name or Organization] using Hugging Face's peft and transformers libraries with a custom Korean sentiment dataset.

Model Type:

This is a transformer-based model for multi-class sentiment classification in the Korean language.

Language:

  • Language(s): Korean

License:

[Add relevant license here]

Finetuned From:

  • Base Model: google/gemma-2-2b-it

Framework Versions:

  • Transformers: 4.44.2
  • PEFT: 0.12.0
  • Datasets: 3.0.1
  • PyTorch: 2.4.1+cu121

Intended Uses & Limitations

Intended Use:

This model is suitable for applications requiring multi-class sentiment classification in Korean, such as chatbots, social media monitoring, or customer feedback analysis.

Out-of-Scope Use:

The model may not perform optimally for tasks requiring multi-language support, sentiment classification with additional classes, or outside the specific context of Korean language data.

Limitations:

  • Bias: As the model is trained on a custom dataset, it may reflect specific biases inherent in that data.
  • Generalization: Performance may vary when applied to datasets outside the scope of the initial training data, such as other forms of sentiment classification.

Model Architecture

Quantization:

The model uses 4-bit quantization via BitsAndBytes for efficient memory usage, which enables it to run on lower-resource hardware.

LoRA Configuration:

LoRA (Low-Rank Adaptation) was applied to specific transformer layers, allowing for parameter-efficient fine-tuning. The target modules include:

  • down_proj, gate_proj, q_proj, o_proj, up_proj, v_proj, k_proj

LoRA parameters are:

  • r = 16, lora_alpha = 32, lora_dropout = 0.05

Custom Weighted Loss:

A custom weighted loss function was implemented to handle class imbalance, using the following weights:

[ \text{weights} = [0.2032, 0.2704, 0.2529, 0.2735] ]

These weights correspond to the classes: 무감정, 슬픔, 기쁨, 분노, respectively.

Training Details

Dataset:

The model was trained on a custom Korean sentiment analysis dataset. This dataset consists of text samples labeled with one of four sentiment classes: 무감정, 슬픔, 기쁨, and 분노.

  • Train Set Size: Custom dataset
  • Test Set Size: Custom dataset
  • Classes: 4 (무감정, 슬픔, 기쁨, 분노)

Preprocessing:

Data was tokenized using the google/gemma-2-2b-it tokenizer with a maximum sequence length of 128. The preprocessing steps included padding and truncation to ensure consistent input lengths.

Hyperparameters:

  • Learning Rate: 2e-4
  • Batch Size (train): 8
  • Batch Size (eval): 8
  • Epochs: 4
  • Optimizer: AdamW (with 8-bit optimization)
  • Weight Decay: 0.01
  • Gradient Accumulation Steps: 2
  • Evaluation Steps: 500
  • Logging Steps: 500
  • Metric for Best Model: F1 (weighted)

Evaluation

Metrics:

The model was evaluated using the following metrics:

  • Accuracy
  • F1 Score (weighted)
  • Precision (weighted)
  • Recall (weighted)

The evaluation provides a detailed view of the model's performance across multiple metrics, which helps in understanding its strengths and areas for improvement.

Code Example:

You can load the fine-tuned model and use it for inference on your own data as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("your-model-directory")
tokenizer = AutoTokenizer.from_pretrained("your-model-directory")

# Tokenize input text
text = "이 영화는 정말 슬퍼요."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()

# Map prediction to label
id2label = {0: "무감정", 1: "슬픔", 2: "기쁨", 3: "분노"}
print(f"Predicted sentiment: {id2label[predicted_class]}")
Downloads last month
1
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for pengsu/MLB-care-for-mind-kor

Base model

google/gemma-2-2b
Adapter
(178)
this model