how to infer text-img pair demo?

#2
by WinstonDeng - opened

Using openai official text model, text embedding dim is 768, mismatching with llm2clip img embedding dim 1280.

text_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14-336")
inputs = tokenizer(text=texts, padding=True, return_tensors="pt").to(device)
text_features = text_model.get_text_features(**inputs) # [1, 768]
Microsoft org

Hello, we will upload our text model to Hugging Face within the next couple of days and aim to release all the parameters of the text models, adapters, and related components. Previously, we experienced some delays due to precision issues during the Hugging Face conversion process, but we have resolved them and will soon upload all the parameters you might need. We welcome your suggestions and requests and will do our best to update versions to meet your requirements, making it more convenient for everyone to conduct research.

Microsoft org

@WinstonDeng We have updated the caption contrastive fine-tuned version of Llama3-8B-CC (https://huggingface.co/microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned) to assist with your retrieval experiments and training of your own CLIP models. Additionally, the parameters for our adapter and projector have been made available in our OpenAI ViT-L repository (https://huggingface.co/microsoft/LLM2CLIP-Openai-L-14-336). The retrieval testing methods are documented in the model card for reference.

Our tests show retrieval performance exceeding the results reported in the paper, and we encourage you to try it out.

Regarding the EVA series of models, there have been precision mismatches during the conversion to Hugging Face, which are currently being fixed. Updates will be released progressively.

Furthermore, we will provide detailed instructions on how to use LLM2CLIP to fine-tune your own CLIP models in about a week—please stay tuned!

Hi, where is llm2vec ? could you please add instruction for how to install the packages

Sorry for the inconvenience. You can simply install llm2vec using pip by running the following command:
pip install llm2vec

Sorry for the inconvenience. You can simply install llm2vec using pip by running the following command:
pip install llm2vec

thank you. I fugured it out. However, pip install configuration_clip is something which I am not able to do.

ImportError: This modeling file requires the following packages that were not found in your environment: configuration_clip. Run pip install configuration_clip

The issue you're encountering might be related to the version of the transformers library. I've tested the installation on version 4.40.2, and it works without any errors. To ensure compatibility, I've updated the configuration file to include a version restriction for the transformers library.

Update your transformers library to version 4.40.2:
pip install transformers==4.40.2

The issue you're encountering might be related to the version of the transformers library. I've tested the installation on version 4.40.2, and it works without any errors. To ensure compatibility, I've updated the configuration file to include a version restriction for the transformers library.

Update your transformers library to version 4.40.2:
pip install transformers==4.40.2

Thanks @weiquan could you please share a very simple example of getting text features ? In the guide, i can only see image features examples, but not text features.

The issue you're encountering might be related to the version of the transformers library. I've tested the installation on version 4.40.2, and it works without any errors. To ensure compatibility, I've updated the configuration file to include a version restriction for the transformers library.

Update your transformers library to version 4.40.2:
pip install transformers==4.40.2

Thanks @weiquan could you please share a very simple example of getting text features ? In the guide, i can only see image features examples, but not text features.

NVM , this worked .. :)

from PIL import Image
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
import torch
from llm2vec import LLM2Vec
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True).to('cuda').eval()

llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' # Workaround for LLM2VEC
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)

captions = ["2 cars on street", "1 car on street", "this photo has a bag on woman shoulder"]
image_path = "test.jpg"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

This comment has been hidden

Sign up or log in to comment