--- language: en tags: - clip - breakdance - video-classification - dance - pytorch - vision-encoder license: MIT datasets: - custom library_name: transformers base_model: openai/clip-vit-large-patch14 pipeline_tag: video-classification model-index: - name: CLIP-Based Break Dance Move Classifier results: - task: type: video-classification dataset: name: custom_breakdance type: custom metrics: - name: Overall Accuracy type: accuracy value: [specify %] - name: Windmill Precision type: precision value: [specify %] - name: Halo Precision type: precision value: [specify %] - name: Swipe Precision type: precision value: [specify %] --- # CLIP-Based Break Dance Move Classifier This model is a fine-tuned version of CLIP (ViT-Large/14) specialized in classifying break dance power moves from video frames, including windmills, halos, and swipes. ## Model Description - **Model Type:** Custom CLIP-based architecture (VariableLengthCLIP) - **Base Model:** CLIP ViT-Large/14 (for feature extraction) - **Architecture:** - Uses CLIP's vision encoder for frame-level feature extraction - Processes multiple frames from a video - Averages frame features - Projects to 3 classes via a learned linear layer - **Task:** Video Classification - **Training Data:** Custom break dance video dataset - **Output:** 3 classes of break dance moves (windmill, halo, swipe) ## Usage ```python import torch from transformers import CLIPProcessor from PIL import Image import cv2 import numpy as np from src.models.model import create_model # Load model and processor model = create_model(num_classes=3, pretrained_model_name="openai/clip-vit-large-patch14") state_dict = torch.load("model.pth") model.load_state_dict(state_dict) processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") # Process video def process_video(video_path, model, processor): video = cv2.VideoCapture(video_path) frames = [] while video.isOpened(): ret, frame = video.read() if not ret: break frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frame_pil = Image.fromarray(frame_rgb) processed = processor(images=frame_pil, return_tensors="pt") frames.append(processed.pixel_values) video.release() # Stack frames and process frames_tensor = torch.cat(frames, dim=0) with torch.no_grad(): predictions = model(frames_tensor.unsqueeze(0)) return predictions ``` ## Limitations - Model performance may vary with video quality and lighting conditions - Best results are achieved with clear, centered shots of the dance moves - May have difficulty distinguishing between similar power moves - Performance may be affected by unusual camera angles or partial views - Currently only supports three specific power moves (windmills, halos, and swipes) ## Training Procedure - Fine-tuned on CLIP ViT-Large/14 architecture - Training dataset: Custom dataset of break dance videos - Dataset size: [specify number] frames from [specify number] different videos - Training epochs: [specify number] - Learning rate: [specify rate] - Batch size: [specify size] - Hardware used: [specify GPU/CPU details] ## Evaluation Results - Overall accuracy: [specify %] Per-class performance: - Windmills: [specify precision/recall] - Halos: [specify precision/recall] - Swipes: [specify precision/recall] ## Citation If you use this model in your research or project, please cite: ```bibtex @misc{clip-breakdance-classifier, author = {Bryant Wolf}, title = {CLIP-Based Break Dance Move Classifier}, year = {2024}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/bawolf/clip-breakdance-classifier}} } ```