|
--- |
|
license: apache-2.0 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- clip |
|
- vision |
|
datasets: |
|
- Ziyang/yfcc15m |
|
- conceptual_captions |
|
--- |
|
<h1 align="center">UForm</h1> |
|
<h3 align="center"> |
|
Pocket-Sized Multimodal AI<br/> |
|
For Content Understanding and Generation<br/> |
|
In Python, JavaScript, and Swift<br/> |
|
</h3> |
|
|
|
--- |
|
|
|
The `uform3-image-text-english-base` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space. |
|
This model produces up to __256-dimensional embeddings__ and is made of: |
|
|
|
* Text encoder: 4-layer BERT for up to 64 input tokens. |
|
* Visual encoder: ViT-B/16 for images of 224 x 224 resolution. |
|
|
|
Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training. |
|
Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code. |
|
If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/). |
|
For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/). |
|
|
|
## Evaluation |
|
|
|
On text-to-image retrieval it reaches 94% Recall@10 for Flickr: |
|
|
|
| Dataset | Recall@1 | Recall@5 | Recall@10 | |
|
| :-------- | -------: | -------: | --------: | |
|
| Zero-Shot Flickr | 0.727 | 0.915 | 0.949 | |
|
| MS-COCO ¹ | 0.510 | 0.761 | 0.838 | |
|
|
|
> ¹ It's important to note, that the MS-COCO train split was present in the training data. |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install "uform[torch,onnx]" |
|
``` |
|
|
|
## Usage |
|
|
|
To load the model: |
|
|
|
```python |
|
from uform import get_model, Modality |
|
|
|
import requests |
|
from io import BytesIO |
|
from PIL import Image |
|
|
|
model_name = 'unum-cloud/uform3-image-text-english-base' |
|
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER] |
|
processors, models = get_model(model_name, modalities=modalities) |
|
|
|
model_text = models[Modality.TEXT_ENCODER] |
|
model_image = models[Modality.IMAGE_ENCODER] |
|
processor_text = processors[Modality.TEXT_ENCODER] |
|
processor_image = processors[Modality.IMAGE_ENCODER] |
|
``` |
|
|
|
To encode the content: |
|
|
|
```python |
|
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background' |
|
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg' |
|
image_url = Image.open(BytesIO(requests.get(image_url).content)) |
|
|
|
image_data = processor_image(image) |
|
text_data = processor_text(text) |
|
image_features, image_embedding = model_image.encode(image_data, return_features=True) |
|
text_features, text_embedding = model_text.encode(text_data, return_features=True) |
|
``` |
|
|