CLIP model post-trained on 80M human face images.
Trained with TencentPretrain framework on 8 * A100 GPUs:
python3 pretrain.py --dataset_path faceclip.pt \
--pretrained_model_path models/clip-b32.bin \
--output_model_path models/faceclip-b32.bin \
--config_path models/clip/base-32_config.json \
--vocab_path vocab.json --merges_path merges.txt --tokenizer clip \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 --data_processor clip --accumulation_steps 8 --learning_rate 2e-5 \
--total_steps 200000 --save_checkpoint_steps 20000 --batch_size 160 --report_steps 500
How to use:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("P01son/FaceCLIP-base-32")
processor = CLIPProcessor.from_pretrained("P01son/FaceCLIP-base-32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
- Downloads last month
- 234
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.