jina-clip-v2 / README.md
gmastrapas's picture
feat: push last checkpoint
b8b8f72
|
raw
history blame
9.25 kB
metadata
library_name: transformers
license: cc-by-nc-4.0
tags:
  - xlm-roberta
  - eva02
  - clip
  - feature-extraction
  - sentence-similarity
  - retrieval
  - multimodal
  - multi-modal
  - crossmodal
  - cross-modal
  - mteb
  - clip-benchmark
  - vidore
  - transformers
  - sentence-transformers
  - onnx
  - safetensors
  - transformers.js
language:
  - multilingual
  - ar
  - bn
  - da
  - de
  - el
  - en
  - es
  - fi
  - fr
  - hi
  - id
  - it
  - ja
  - ka
  - ko
  - lv
  - nl
  - 'no'
  - pl
  - pt
  - ro
  - ru
  - sk
  - sv
  - th
  - tr
  - uk
  - ur
  - vi
  - zh
inference: false



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The embedding set trained by Jina AI.

Jina CLIP: your CLIP model is also your text retriever!

Intended Usage & Model Info

jina-clip-v2 is a state-of-the-art multilingual and multimodal (text-image) embedding model.

jina-clip-v2 is a successor to the jina-clip-v1 model and brings new features and capabilities, such as:

  • support for multiple languages - the text tower now supports 30 languages, including en, zh, de, ar, hi, es
  • embedding truncation on both image and text vectors - both towers are trained using Matryoshka Representation Learning which enables slicing the output vectors and in as a result computation and storage costs as well
  • visual document retrieval performance boost - with an image resolution of 384 (compared to 224 on jina-clip-v1) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks, as is evident by the performance gains on the ViDoRe Benchmark, compared to jina-clip-v1

Similar to our predecessor model, jina-clip-v2 bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, jina-clip-v2 offers state-of-the-art performance on both tasks. This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.

Data & Parameters

Check out our paper. Updated technical report for v2 coming soon!

Usage

  1. The easiest way to start using jina-clip-v2 is via Jina AI's Embeddings API.
  2. Alternatively, you can use the model directly via the transformers/sentence-transformers package.
# !pip install transformers einops timm pillow
from transformers import AutoModel

# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True)

# Sentences
sentences = ['A blue cat', 'A red cat']

# Public image URLs
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Choose a matryoshka dimension, set to None to get the full 1024-dim vectors
truncate_dim = 512

# Encode text and images
text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim)
image_embeddings = model.encode_image(image_urls, truncate_dim=truncate_dim)  # also accepts PIL.image, local filenames, dataURI

# Compute similarities
print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
print(text_embeddings[0] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity

or via sentence-transformers:

# !pip install sentence-transformers 
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('jinaai/jina-clip-v2', trust_remote_code=True)

# Sentences
sentences = ['A blue cat', 'A red cat']

# Public image URLs
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)

JavaScript developers can use Jina CLIP via the transformers.js library. Note that to use this model, you need to install transformers.js v3 from source using npm install xenova/transformers.js#v3.

import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v2');
const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v2');

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v2');

// Run tokenization
const texts = ['A blue cat', 'A red cat'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);

// Read images and run processor
const urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
const image = await Promise.all(urls.map(url => RawImage.read(url)));
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);

//  Compute similarities
console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity

Performance

Text-Image Retrieval

Coming soon!

Text-Text Retrieval

Coming soon!

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find jina-clip-v2 useful in your research, please cite the following paper:

@misc{2405.20204,
    Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
    Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprint = {arXiv:2405.20204},
}

FAQ

I encounter this problem, what should I do?

ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_clip.JinaCLIPConfig'> and you passed <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_cli.JinaCLIPConfig'>. Fix one of those so they match!

There was a bug in Transformers library between 4.40.x to 4.41.1. You can update transformers to >4.41.2 or <=4.40.0

Given one query, how can I merge its text-text and text-image cosine similarity?

Our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity! If you want to merge two scores, we recommended 2 ways:

  1. weighted average of text-text sim and text-image sim:
combined_scores = sim(text, text) + lambda * sim(text, image)  # optimal lambda depends on your dataset, but in general lambda=2 can be a good choice.
  1. apply z-score normalization before merging scores:
# pseudo code
query_document_mean = np.mean(cos_sim_text_texts)
query_document_std = np.std(cos_sim_text_texts)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std