Dua-Vision-Base
A Vision Encoder-Decoder model that doesn’t just caption images but generates questions and possible answers based on what it “sees.” Using ViT as the encoder and BART as the decoder, it’s built for image-based QA without the fluff.
Translation: feed it an image, and get back a useful question-answer pair. Perfect for creating and synthesizing data in image QA tasks. It’s one model, two tasks, and a lot of potential!
#LLMs #VisionTransformer #ImageQA #AI
Dua-Vision-Base is a Vision Encoder-Decoder model. This model integrates Vision Transformer (ViT) as the encoder and BART as the decoder, enabling effective processing and contextual interpretation of visual inputs alongside natural language generation.
Model Architecture
- Encoder: ViT (Vision Transformer), pre-trained on
vit-base-patch16-224-in21k
from Google. - Decoder: BART (Bidirectional and Auto-Regressive Transformers) model pre-trained on
facebook/bart-base
.
Usage
To use this model with images, you’ll need the necessary components: the ViTImageProcessor
for handling visual inputs and the BartTokenizer
for processing text prompts. This model is optimized for generating question and an answer for given images, adhering to the following specifications:
Input:
- Images in RGB format (processed via
ViTImageProcessor
). - Textual prompts using
BartTokenizer
for contextual initialization.
- Images in RGB format (processed via
Output:
- Textual question & answer generated based on the visual content in the image.
Installation
!pip install transformers datasets torch torchvision
How to Load the Model
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, BartTokenizer
# Load model, processor, and tokenizer
model = VisionEncoderDecoderModel.from_pretrained("HV-Khurdula/Dua-Vision-Base")
image_processor = ViTImageProcessor.from_pretrained("HV-Khurdula/Dua-Vision-Base")
tokenizer = BartTokenizer.from_pretrained("HV-Khurdula/Dua-Vision-Base")
Inference Example
Here's a sample usage for generating captions for an image:
# Load image and process
image_url = "https://example.com/image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
# Generate caption
generated_ids = model.generate(pixel_values, max_length=128, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated:", generated_text)
Training
The model was trained on a dataset of conversational prompts alongside images. During training, captions were generated based on both the image content and specific prompts, enhancing contextual relevancy in generated captions. It is highly recommended to fine-tune the model, according to the task.
Hyperparameters
- Batch Size: 16
- Learning Rate: 5e-5
- Epochs: 5
License
This model and its code are released under the terms of the Apache 2.0 license.
- Downloads last month
- 7
Model tree for HV-Khurdula/Dua-Vision-Base
Base model
facebook/bart-large-mnli