A fine-tune of Long-CLIP - original model: BeichenZhang/LongCLIP-L

  • ❀️ this CLIP? Help feed it if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! πŸ€—
  • Want to feed it yourself? All code for fine-tuning and much more is on my GitHub.

  • Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion:

  • Get the ComfyUI Long-CLIP nodes here: https://github.com/SeaArtLab/ComfyUI-Long-CLIP
  • If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! πŸ€—

🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: πŸ‘€

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

❌ Error due to mismatch with defined 77 tokens in Transformers library

πŸ‘‡

Option 1 (simple & worse):

Truncate to 77 tokens CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') πŸ“‰

πŸ‘‡

Option 2, proper integration: πŸ’– RECOMMENDED πŸ’–

  • Solution for implementation of 248 tokens / thanks @kk3dmax πŸ€—

  • Obtain a full example script using this solution for Flux.1 inference on my GitHub
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16
# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') βœ…

Update 12/AUG/2024:

New BEST model, custom loss with label smoothing. Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible! Fine-tune your model with the provided code for GmP-Smooth: https://github.com/zer0int/Long-CLIP

image/png


The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.

Made possible with Geometric Parametrization (GmP):


"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

image/png

βœ… The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using SeaArtLab/ComfyUI-Long-CLIP custom nodes! πŸ€—

** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: https://github.com/zer0int/Long-CLIP

@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}

Pre-trained CLIP model by OpenAI, License: MIT License

Downloads last month
1,507
Safetensors
Model size
428M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for zer0int/LongCLIP-GmP-ViT-L-14

Finetuned
(3)
this model

Dataset used to train zer0int/LongCLIP-GmP-ViT-L-14

Spaces using zer0int/LongCLIP-GmP-ViT-L-14 2