bawolf's picture
model card updates
e8f89ea
|
raw
history blame
3.85 kB
---
language: en
tags:
- clip
- breakdance
- video-classification
- dance
- pytorch
- vision-encoder
license: MIT
datasets:
- custom
library_name: transformers
base_model: openai/clip-vit-large-patch14
pipeline_tag: video-classification
model-index:
- name: CLIP-Based Break Dance Move Classifier
results:
- task:
type: video-classification
dataset:
name: custom_breakdance
type: custom
metrics:
- name: Overall Accuracy
type: accuracy
value: [specify %]
- name: Windmill Precision
type: precision
value: [specify %]
- name: Halo Precision
type: precision
value: [specify %]
- name: Swipe Precision
type: precision
value: [specify %]
---
# CLIP-Based Break Dance Move Classifier
This model is a fine-tuned version of CLIP (ViT-Large/14) specialized in classifying break dance power moves from video frames, including windmills, halos, and swipes.
## Model Description
- **Model Type:** Custom CLIP-based architecture (VariableLengthCLIP)
- **Base Model:** CLIP ViT-Large/14 (for feature extraction)
- **Architecture:**
- Uses CLIP's vision encoder for frame-level feature extraction
- Processes multiple frames from a video
- Averages frame features
- Projects to 3 classes via a learned linear layer
- **Task:** Video Classification
- **Training Data:** Custom break dance video dataset
- **Output:** 3 classes of break dance moves (windmill, halo, swipe)
## Usage
```python
import torch
from transformers import CLIPProcessor
from PIL import Image
import cv2
import numpy as np
from src.models.model import create_model
# Load model and processor
model = create_model(num_classes=3, pretrained_model_name="openai/clip-vit-large-patch14")
state_dict = torch.load("model.pth")
model.load_state_dict(state_dict)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Process video
def process_video(video_path, model, processor):
video = cv2.VideoCapture(video_path)
frames = []
while video.isOpened():
ret, frame = video.read()
if not ret:
break
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame_pil = Image.fromarray(frame_rgb)
processed = processor(images=frame_pil, return_tensors="pt")
frames.append(processed.pixel_values)
video.release()
# Stack frames and process
frames_tensor = torch.cat(frames, dim=0)
with torch.no_grad():
predictions = model(frames_tensor.unsqueeze(0))
return predictions
```
## Limitations
- Model performance may vary with video quality and lighting conditions
- Best results are achieved with clear, centered shots of the dance moves
- May have difficulty distinguishing between similar power moves
- Performance may be affected by unusual camera angles or partial views
- Currently only supports three specific power moves (windmills, halos, and swipes)
## Training Procedure
- Fine-tuned on CLIP ViT-Large/14 architecture
- Training dataset: Custom dataset of break dance videos
- Dataset size: [specify number] frames from [specify number] different videos
- Training epochs: [specify number]
- Learning rate: [specify rate]
- Batch size: [specify size]
- Hardware used: [specify GPU/CPU details]
## Evaluation Results
- Overall accuracy: [specify %]
Per-class performance:
- Windmills: [specify precision/recall]
- Halos: [specify precision/recall]
- Swipes: [specify precision/recall]
## Citation
If you use this model in your research or project, please cite:
```bibtex
@misc{clip-breakdance-classifier,
author = {Bryant Wolf},
title = {CLIP-Based Break Dance Move Classifier},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/bawolf/clip-breakdance-classifier}}
}
```