--- tags: - model_hub_mixin - pytorch_model_hub_mixin license: mit base_model: - microsoft/swin-base-patch4-window7-224 pipeline_tag: image-feature-extraction --- This is fashion image feature extractor model. I used [microsoft/swin-base-patch4-window7-224](https://huggingface.co/microsoft/swin-base-patch4-window7-224) for base image encoder model. Just added a 128 size fully connected layer to lower embedding size. The dataset used anchor (product areas detected from posts) - positive (product thumbnail) image pairs. Within each batch, all samples except one's own positive were used as negative samples, training to minimize the distance between anchor-positive pairs while maximizing the distance between anchor-negative pairs. This method is known as contrastive learning, which is the training method used by OpenAI's CLIP model. Initially, anchor - positive - negative pairs were explicitly constructed in a 1:1:1 ratio using triplet loss, but training with in-batch negative sampling and contrastive loss showed much better performance as it allowed learning from more negative samples. You can find object-detection model -> [https://huggingface.co/yainage90/fashion-object-detection](https://huggingface.co/yainage90/fashion-object-detection) You can find details of model in this github repo -> [fashion-visual-search](https://github.com/yainage90/fashion-visual-search) ```python from PIL import Image import torch import torch.nn as nn import torch.nn.functional as F import torchvision.transforms as v2 from transformers import AutoImageProcessor, SwinModel, SwinConfig from huggingface_hub import PyTorchModelHubMixin device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') ckpt = "yainage90/fashion-image-feature-extractor" encoder_config = SwinConfig.from_pretrained(ckpt) encoder_image_processor = AutoImageProcessor.from_pretrained(ckpt) class ImageEncoder(nn.Module, PyTorchModelHubMixin): def __init__(self): super(ImageEncoder, self).__init__() self.swin = SwinModel(config=encoder_config) self.embedding_layer = nn.Linear(encoder_config.hidden_size, 128) def forward(self, image_tensor): features = self.swin(image_tensor).pooler_output embeddings = self.embedding_layer(features) embeddings = F.normalize(embeddings, p=2, dim=1) return embeddings encoder = ImageEncoder().from_pretrained('yainage90/fashion-image-feature-extractor').to(device) transform = v2.Compose([ v2.Resize((encoder_config.image_size, encoder_config.image_size)), v2.ToTensor(), v2.Normalize(mean=encoder_image_processor.image_mean, std=encoder_image_processor.image_std), ]) image = Image.open('').convert('RGB') image = transform(image) with torch.no_grad(): embedding = encoder(image.unsqueeze(0).to(device)).cpu().numpy() ``` ![sample_image](sample_image.png)