trojblue commited on
Commit
ed9f60b
·
1 Parent(s): 33bc1aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -18
README.md CHANGED
@@ -9,34 +9,32 @@ tags:
9
  pipeline_tag: image-to-text
10
  ---
11
 
12
- # BLIP-2, OPT-6.7b, fine-tuned on COCO
13
 
14
- This is a fp16 version of the BLIP-2 model, leveraging [OPT-6.7b](https://huggingface.co/facebook/opt-6.7b) (a large language model with 6.7 billion parameters).
15
- It was introduced in the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Li et al. and first released in [this repository](https://github.com/salesforce/LAVIS/tree/main/projects/blip2).
16
 
17
- - Refer to the [original model card](https://huggingface.co/Salesforce/blip2-opt-6.7b-coco) for more details about the model description, intended uses, and limitations, as well as instructions for how to use the model on CPU and GPU in different precisions.
18
 
 
19
 
20
- ## Model description
21
 
22
- BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model.
23
 
24
- The authors initialize the weights of the image encoder and large language model from pre-trained checkpoints and keep them frozen
25
- while training the Querying Transformer, which is a BERT-like Transformer encoder that maps a set of "query tokens" to query embeddings,
26
- which bridge the gap between the embedding space of the image encoder and the large language model.
27
 
28
- The goal for the model is simply to predict the next text token, giving the query embeddings and the previous text.
29
 
30
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/blip2_architecture.jpg"
31
- alt="drawing" width="600"/>
32
 
33
- This allows the model to be used for tasks like:
34
 
35
- - image captioning
36
- - visual question answering (VQA)
37
- - chat-like conversations by feeding the image and the previous conversation as prompt to the model
38
 
 
39
 
40
- ### How to use
 
 
41
 
42
- For code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/blip-2#transformers.Blip2ForConditionalGeneration.forward.example).
 
9
  pipeline_tag: image-to-text
10
  ---
11
 
12
+ # BLIP-2, OPT-6.7b, Fine-tuned on COCO - Unofficial FP16 Version
13
 
14
+ This repository contains an unofficial version of the BLIP-2 model, leveraging [OPT-6.7b](https://huggingface.co/facebook/opt-6.7b), which has been fine-tuned on COCO and converted to FP16 for reduced model size and memory footprint.
 
15
 
16
+ The original model, BLIP-2, was introduced in the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Li et al. and first released in [this repository](https://github.com/salesforce/LAVIS/tree/main/projects/blip2).
17
 
18
+ For a comprehensive understanding of the model, its description, intended uses, limitations, and instructions on usage with different hardware and precision settings, please refer to the [official model card](https://huggingface.co/Salesforce/blip2-opt-6.7b-coco).
19
 
20
+ ## Unofficial FP16 Version
21
 
22
+ This version of the BLIP-2 model has been converted to use FP16 precision, which effectively reduces the model size and memory requirements. The conversion to FP16 can potentially accelerate the model's computation time on hardware with FP16 support, although it might slightly affect the model's performance due to reduced numerical precision.
23
 
24
+ This unofficial FP16 version is ideal for situations where storage, memory, or computational resources are limited.
 
 
25
 
26
+ Please note, this is an **unofficial** repository and not maintained or endorsed by the original authors of the model. The FP16 conversion was conducted independently and any potential issues, limitations or discrepancies with the original model are not the responsibility of the original authors.
27
 
28
+ ### How to use
 
29
 
30
+ The usage of this FP16 version of the model is similar to the original model. For specific code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/blip-2#transformers.Blip2ForConditionalGeneration.forward.example).
31
 
32
+ Please ensure to test the performance and accuracy of this FP16 model thoroughly in your specific use-case to confirm it meets your needs.
 
 
33
 
34
+ This version can be used for tasks like:
35
 
36
+ - image captioning
37
+ - visual question answering (VQA)
38
+ - chat-like conversations by feeding the image and the previous conversation as a prompt to the model
39
 
40
+ *Disclaimer: This is an unofficial version of the model and any potential issues or discrepancies from the official model are not the responsibility of the original authors.*