bawolf's picture
model card updates
e8f89ea
|
raw
history blame
3.85 kB
metadata
language: en
tags:
  - clip
  - breakdance
  - video-classification
  - dance
  - pytorch
  - vision-encoder
license: MIT
datasets:
  - custom
library_name: transformers
base_model: openai/clip-vit-large-patch14
pipeline_tag: video-classification
model-index:
  - name: CLIP-Based Break Dance Move Classifier
    results:
      - task:
          type: video-classification
        dataset:
          name: custom_breakdance
          type: custom
        metrics:
          - name: Overall Accuracy
            type: accuracy
            value:
              - specify %
          - name: Windmill Precision
            type: precision
            value:
              - specify %
          - name: Halo Precision
            type: precision
            value:
              - specify %
          - name: Swipe Precision
            type: precision
            value:
              - specify %

CLIP-Based Break Dance Move Classifier

This model is a fine-tuned version of CLIP (ViT-Large/14) specialized in classifying break dance power moves from video frames, including windmills, halos, and swipes.

Model Description

  • Model Type: Custom CLIP-based architecture (VariableLengthCLIP)
  • Base Model: CLIP ViT-Large/14 (for feature extraction)
  • Architecture:
    • Uses CLIP's vision encoder for frame-level feature extraction
    • Processes multiple frames from a video
    • Averages frame features
    • Projects to 3 classes via a learned linear layer
  • Task: Video Classification
  • Training Data: Custom break dance video dataset
  • Output: 3 classes of break dance moves (windmill, halo, swipe)

Usage

import torch
from transformers import CLIPProcessor
from PIL import Image
import cv2
import numpy as np
from src.models.model import create_model

# Load model and processor
model = create_model(num_classes=3, pretrained_model_name="openai/clip-vit-large-patch14")
state_dict = torch.load("model.pth")
model.load_state_dict(state_dict)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Process video
def process_video(video_path, model, processor):
    video = cv2.VideoCapture(video_path)
    frames = []

    while video.isOpened():
        ret, frame = video.read()
        if not ret:
            break

        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame_pil = Image.fromarray(frame_rgb)
        processed = processor(images=frame_pil, return_tensors="pt")
        frames.append(processed.pixel_values)

    video.release()

    # Stack frames and process
    frames_tensor = torch.cat(frames, dim=0)
    with torch.no_grad():
        predictions = model(frames_tensor.unsqueeze(0))

    return predictions

Limitations

  • Model performance may vary with video quality and lighting conditions
  • Best results are achieved with clear, centered shots of the dance moves
  • May have difficulty distinguishing between similar power moves
  • Performance may be affected by unusual camera angles or partial views
  • Currently only supports three specific power moves (windmills, halos, and swipes)

Training Procedure

  • Fine-tuned on CLIP ViT-Large/14 architecture
  • Training dataset: Custom dataset of break dance videos
  • Dataset size: [specify number] frames from [specify number] different videos
  • Training epochs: [specify number]
  • Learning rate: [specify rate]
  • Batch size: [specify size]
  • Hardware used: [specify GPU/CPU details]

Evaluation Results

  • Overall accuracy: [specify %] Per-class performance:
  • Windmills: [specify precision/recall]
  • Halos: [specify precision/recall]
  • Swipes: [specify precision/recall]

Citation

If you use this model in your research or project, please cite:

@misc{clip-breakdance-classifier,
  author = {Bryant Wolf},
  title = {CLIP-Based Break Dance Move Classifier},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/bawolf/clip-breakdance-classifier}}
}