breaking-vision-clip-classifier / model-card.md

model card updates

e8f89ea 3 months ago

3.85 kB

	---
	language: en
	tags:
	- clip
	- breakdance
	- video-classification
	- dance
	- pytorch
	- vision-encoder
	license: MIT
	datasets:
	- custom
	library_name: transformers
	base_model: openai/clip-vit-large-patch14
	pipeline_tag: video-classification
	model-index:
	- name: CLIP-Based Break Dance Move Classifier
	results:
	- task:
	type: video-classification
	dataset:
	name: custom_breakdance
	type: custom
	metrics:
	- name: Overall Accuracy
	type: accuracy
	value: [specify %]
	- name: Windmill Precision
	type: precision
	value: [specify %]
	- name: Halo Precision
	type: precision
	value: [specify %]
	- name: Swipe Precision
	type: precision
	value: [specify %]
	---

	# CLIP-Based Break Dance Move Classifier

	This model is a fine-tuned version of CLIP (ViT-Large/14) specialized in classifying break dance power moves from video frames, including windmills, halos, and swipes.

	## Model Description

	- Model Type: Custom CLIP-based architecture (VariableLengthCLIP)
	- Base Model: CLIP ViT-Large/14 (for feature extraction)
	- Architecture:
	- Uses CLIP's vision encoder for frame-level feature extraction
	- Processes multiple frames from a video
	- Averages frame features
	- Projects to 3 classes via a learned linear layer
	- Task: Video Classification
	- Training Data: Custom break dance video dataset
	- Output: 3 classes of break dance moves (windmill, halo, swipe)

	## Usage

	```python
	import torch
	from transformers import CLIPProcessor
	from PIL import Image
	import cv2
	import numpy as np
	from src.models.model import create_model

	# Load model and processor
	model = create_model(num_classes=3, pretrained_model_name="openai/clip-vit-large-patch14")
	state_dict = torch.load("model.pth")
	model.load_state_dict(state_dict)
	processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

	# Process video
	def process_video(video_path, model, processor):
	video = cv2.VideoCapture(video_path)
	frames = []

	while video.isOpened():
	ret, frame = video.read()
	if not ret:
	break

	frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
	frame_pil = Image.fromarray(frame_rgb)
	processed = processor(images=frame_pil, return_tensors="pt")
	frames.append(processed.pixel_values)

	video.release()

	# Stack frames and process
	frames_tensor = torch.cat(frames, dim=0)
	with torch.no_grad():
	predictions = model(frames_tensor.unsqueeze(0))

	return predictions
	```

	## Limitations

	- Model performance may vary with video quality and lighting conditions
	- Best results are achieved with clear, centered shots of the dance moves
	- May have difficulty distinguishing between similar power moves
	- Performance may be affected by unusual camera angles or partial views
	- Currently only supports three specific power moves (windmills, halos, and swipes)

	## Training Procedure

	- Fine-tuned on CLIP ViT-Large/14 architecture
	- Training dataset: Custom dataset of break dance videos
	- Dataset size: [specify number] frames from [specify number] different videos
	- Training epochs: [specify number]
	- Learning rate: [specify rate]
	- Batch size: [specify size]
	- Hardware used: [specify GPU/CPU details]

	## Evaluation Results

	- Overall accuracy: [specify %]
	Per-class performance:
	- Windmills: [specify precision/recall]
	- Halos: [specify precision/recall]
	- Swipes: [specify precision/recall]

	## Citation

	If you use this model in your research or project, please cite:

	```bibtex
	@misc{clip-breakdance-classifier,
	author = {Bryant Wolf},
	title = {CLIP-Based Break Dance Move Classifier},
	year = {2024},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/bawolf/clip-breakdance-classifier}}
	}
	```