TIGER-Lab
/

Mantis-bakllava-7b

@@ -1,57 +1,72 @@
 ---
 tags:
-- generated_from_trainer
 base_model: llava-hf/bakLlava-v1-hf
 model-index:
-- name: llava_bakllava_8192
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# llava_bakllava_8192
-This model is a fine-tuned version of [llava-hf/bakLlava-v1-hf](https://huggingface.co/llava-hf/bakLlava-v1-hf) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 32
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 128
-- total_eval_batch_size: 32
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.03
-- num_epochs: 1.0
-### Training results
-### Framework versions
-- Transformers 4.39.2
-- Pytorch 2.2.1
-- Datasets 2.17.1
-- Tokenizers 0.15.2

 ---
 tags:
+- Mantis
+- VLM
+- LMM
+- Multimodal LLM
+- bakllava
 base_model: llava-hf/bakLlava-v1-hf
 model-index:
+- name: Mantis-bakllava-7b
   results: []
+license: apache-2.0
+language:
+- en
 ---
+# Mantis: Interleaved Multi-Image Instruction Tuning
+**Mantis** is a multimodal conversational AI model that can chat with users about images and text. It's optimized for multi-image reasoning, where interleaved text and images can be used to generate responses.
+Mantis is trained on the newly curated dataset **Mantis-Instruct**, a large-scale multi-image QA dataset that covers various multi-image reasoning tasks.
+Mantis is an active work in progress. Check our [Blog](https://tiger-ai-lab.github.io/Blog/mantis) for more details!
+|[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) | [Blog](https://tiger-ai-lab.github.io/Blog/mantis) | [Github](https://github.com/TIGER-AI-Lab/Mantis) |  [Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) |
+![Mantis](https://raw.githubusercontent.com/TIGER-AI-Lab/Mantis/main/docs/assets/images/overall_barchart.jpeg)
+## Inference
+You can install Mantis's GitHub codes as a Python package
+```bash
+pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
+```
+then run inference with codes here: [examples/run_mantis.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py)
+```python
+from mantis.models.mllava import chat_mllava
+from PIL import Image
+import torch
+image1 = "image1.jpg"
+image2 = "image2.jpg"
+images = [Image.open(image1), Image.open(image2)]
+# load processor and model
+from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
+processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-bakllava-7b")
+model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-bakllava-7b", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
+# chat
+text = "<image> <image> What's the difference between these two images? Please describe as much as you can."
+response, history = chat_mllava(text, images, model, processor)
+print("USER: ", text)
+print("ASSISTANT: ", response)
+# The image on the right has a larger number of wallets displayed compared to the image on the left. The wallets in the right image are arranged in a grid pattern, while the wallets in the left image are displayed in a more scattered manner. The wallets in the right image have various colors, including red, purple, and brown, while the wallets in the left image are primarily brown.
+text = "How many items are there in image 1 and image 2 respectively?"
+response, history = chat_mllava(text, images, model, processor, history=history)
+print("USER: ", text)
+print("ASSISTANT: ", response)
+# There are two items in image 1 and four items in image 2.
+```
+Or, you can run the model without relying on the mantis codes, using pure hugging face transformers. See [examples/run_mantis_hf.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py) for details.
+## Training
+Training codes will be released soon.