5CD-AI
/

Vintern-1B-v3_5

@@ -13,34 +13,46 @@ pipeline_tag: image-text-to-text
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/-G297bBqMzYvTbD6_Bkd9.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/DrUCZuXuMz47uVU4zqnJ4.png)
-## Vintern-1B-v2 ❄️ (Viet-InternVL2-1B-v2) - The LLaVA 🌋 Challenger
-We are excited to introduce  **Vintern-1B-v2** the Vietnamese 🇻🇳 multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [Viet-InternVL2-1B](https://huggingface.co/5CD-AI/Viet-InternVL2-1B) model on over 3 million specialized image-question-answer pairs for optical character recognition 🔍, text recognition 🔤, document extraction 📑, and general VQA. The model can be integrated into various on-device applications 📱, demonstrating its versatility and robust capabilities.
-[**\[🤗 HF Demo\]**](https://huggingface.co/spaces/khang119966/Vintern-v2-Demo)
-The special thing is that our model can be easily finetuned with a T4 GPU on Google Colab by following the instructions provided at the end of this section.
-## Model Details
-|      Model Name      |                                     Vision Part                                     |                                        Language Part                                         |
-| :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
-|      Vintern-1B-v2      |    [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)    |            [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)            |
-Vintern-1B-v2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B-v2 consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).
-## Training details 📚
-The fine-tuning dataset was meticulously sampled in part from the following datasets:
-[Viet-OCR-VQA 📚](https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA), [Viet-Doc-VQA 📄](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA), [Viet-Doc-VQA-II 📑](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA-II), [Vista 🖼️](https://huggingface.co/datasets/Vi-VLM/Vista), [Viet-Receipt-VQA 🧾](https://huggingface.co/datasets/5CD-AI/Viet-Receipt-VQA), [Viet-Sketches-VQA ✏️](https://huggingface.co/datasets/5CD-AI/Viet-Sketches-VQA), [Viet-Geometry-VQA 📐](https://huggingface.co/datasets/5CD-AI/Viet-Geometry-VQA), [Viet-Wiki-Handwriting ✍️](https://huggingface.co/datasets/5CD-AI/Viet-Wiki-Handwriting), [Viet-ComputerScience-VQA 💻](https://huggingface.co/datasets/5CD-AI/Viet-ComputerScience-VQA), [Viet-Handwriting-gemini-VQA 🖋️](https://huggingface.co/datasets/5CD-AI/Viet-Handwriting-gemini-VQA), [Viet-Menu-gemini-VQA 🍽️](https://huggingface.co/datasets/5CD-AI/Viet-Menu-gemini-VQA), [Viet-Vintext-gemini-VQA 📜](https://huggingface.co/datasets/5CD-AI/Viet-Vintext-gemini-VQA), [Viet-OpenViVQA-gemini-VQA 🧠](https://huggingface.co/datasets/5CD-AI/Viet-OpenViVQA-gemini-VQA), [Viet-Resume-VQA 📃](https://huggingface.co/datasets/5CD-AI/Viet-Resume-VQA), [Viet-ViTextVQA-gemini-VQA 📑](https://huggingface.co/datasets/5CD-AI/Viet-ViTextVQA-gemini-VQA)
 ## Benchmarks 📈
 ## Examples

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/-G297bBqMzYvTbD6_Bkd9.png)
+# Vintern-1B-v3.5 ❄️ (Viet-InternVL2-1B-v3.5) - The Ultimate Multimodal Solution 🌏
+We are thrilled to announce **Vintern-1B-v3.5**, the latest version in the Vintern series, offering significant improvements over v2 across all evaluation benchmarks. This model has been fine-tuned from **InternVL-1B-2.5**, which already good in Vietnamese tasks thanks to leveraging [Viet-ShareGPT-4o-Text-VQA](https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA) data during its fine-tuning process by the InternVL 2.5 [1] team.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/a1V1DA1o4Gf_MJblWTz-L.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/36jb5bgyYCoVKx3NE8Iuv.png)
+To further enhance its performance in Vietnamese while maintaining robust capabilities on existing English datasets, **Vintern-1B-v3.5** has been fine-tuned using a vast amount of Vietnamese-specific data. This results in a model that is exceptionally powerful in text recognition, OCR, and understanding Vietnam-specific documents.
+Key Features 🌟
+- Top Quality for Vietnamese Texts
+Vintern-1B-v3.5 is one of the best models in its class (1B parameters) for understanding and processing Vietnamese documents.
+- Better Extraction and Understanding
+The model is great at handling invoices, legal texts, handwriting, and tables.
+- Runs on Affordable Hardware
+You can run the model on Google Colab with a T4 GPU, making it easy to use without expensive devices.
+- Easy to Fine-tune
+The model can be customized for specific tasks with minimal effort.
 ## Benchmarks 📈
+| Benchmark     | InternVL2_5 1B | Vintern-1B-v2 | Vintern-1B-v3.5 |
+|:-------------:|:--------------:|:-------------:|:---------------:|
+| vi-MTVQA      |      24.8      |     37.4      |     41.9        |
+| DocVQAtest    |      84.8      |    72.5      |      78.8       |
+| MMMUval       |      40.9      |     31.3      |      32.4       |
+| InfoVQAtest   |      56.0      |    38.9      |      46.4      |
+| TextVQAval    |      72.0      |    64.0      |      68.2       |
+| ChartQAtest   |      75.9      |    34.1      |      60.0       |
+| OCRBench      |      785       |     628       |      706        |
+| MMEsum        |    1950       |     1185      |      1346       |
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/DrUCZuXuMz47uVU4zqnJ4.png)
 ## Examples