khang119966 commited on
Commit
b626a37
Β·
verified Β·
1 Parent(s): 0c5ef86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -13
README.md CHANGED
@@ -13,34 +13,46 @@ pipeline_tag: image-text-to-text
13
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/-G297bBqMzYvTbD6_Bkd9.png)
15
 
16
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/DrUCZuXuMz47uVU4zqnJ4.png)
17
 
 
 
18
 
19
- ## Vintern-1B-v2 ❄️ (Viet-InternVL2-1B-v2) - The LLaVA πŸŒ‹ Challenger
20
 
21
- We are excited to introduce **Vintern-1B-v2** the Vietnamese πŸ‡»πŸ‡³ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [Viet-InternVL2-1B](https://huggingface.co/5CD-AI/Viet-InternVL2-1B) model on over 3 million specialized image-question-answer pairs for optical character recognition πŸ”, text recognition πŸ”€, document extraction πŸ“‘, and general VQA. The model can be integrated into various on-device applications πŸ“±, demonstrating its versatility and robust capabilities.
22
 
23
- [**\[πŸ€— HF Demo\]**](https://huggingface.co/spaces/khang119966/Vintern-v2-Demo)
24
 
25
- The special thing is that our model can be easily finetuned with a T4 GPU on Google Colab by following the instructions provided at the end of this section.
26
 
27
- ## Model Details
28
 
29
- | Model Name | Vision Part | Language Part |
30
- | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
31
- | Vintern-1B-v2 | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |
32
 
 
 
33
 
34
- Vintern-1B-v2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B-v2 consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).
 
35
 
36
- ## Training details πŸ“š
 
37
 
38
- The fine-tuning dataset was meticulously sampled in part from the following datasets:
39
- [Viet-OCR-VQA πŸ“š](https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA), [Viet-Doc-VQA πŸ“„](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA), [Viet-Doc-VQA-II πŸ“‘](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA-II), [Vista πŸ–ΌοΈ](https://huggingface.co/datasets/Vi-VLM/Vista), [Viet-Receipt-VQA 🧾](https://huggingface.co/datasets/5CD-AI/Viet-Receipt-VQA), [Viet-Sketches-VQA ✏️](https://huggingface.co/datasets/5CD-AI/Viet-Sketches-VQA), [Viet-Geometry-VQA πŸ“](https://huggingface.co/datasets/5CD-AI/Viet-Geometry-VQA), [Viet-Wiki-Handwriting ✍️](https://huggingface.co/datasets/5CD-AI/Viet-Wiki-Handwriting), [Viet-ComputerScience-VQA πŸ’»](https://huggingface.co/datasets/5CD-AI/Viet-ComputerScience-VQA), [Viet-Handwriting-gemini-VQA πŸ–‹οΈ](https://huggingface.co/datasets/5CD-AI/Viet-Handwriting-gemini-VQA), [Viet-Menu-gemini-VQA 🍽️](https://huggingface.co/datasets/5CD-AI/Viet-Menu-gemini-VQA), [Viet-Vintext-gemini-VQA πŸ“œ](https://huggingface.co/datasets/5CD-AI/Viet-Vintext-gemini-VQA), [Viet-OpenViVQA-gemini-VQA 🧠](https://huggingface.co/datasets/5CD-AI/Viet-OpenViVQA-gemini-VQA), [Viet-Resume-VQA πŸ“ƒ](https://huggingface.co/datasets/5CD-AI/Viet-Resume-VQA), [Viet-ViTextVQA-gemini-VQA πŸ“‘](https://huggingface.co/datasets/5CD-AI/Viet-ViTextVQA-gemini-VQA)
40
 
41
  ## Benchmarks πŸ“ˆ
42
 
 
 
 
 
 
 
 
 
 
 
43
 
 
44
 
45
  ## Examples
46
 
 
13
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/-G297bBqMzYvTbD6_Bkd9.png)
15
 
 
16
 
17
+ # Vintern-1B-v3.5 ❄️ (Viet-InternVL2-1B-v3.5) - The Ultimate Multimodal Solution 🌏
18
+ We are thrilled to announce **Vintern-1B-v3.5**, the latest version in the Vintern series, offering significant improvements over v2 across all evaluation benchmarks. This model has been fine-tuned from **InternVL-1B-2.5**, which already good in Vietnamese tasks thanks to leveraging [Viet-ShareGPT-4o-Text-VQA](https://huggingface.co/datasets/5CD-AI/Viet-ShareGPT-4o-Text-VQA) data during its fine-tuning process by the InternVL 2.5 [1] team.
19
 
20
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/a1V1DA1o4Gf_MJblWTz-L.png)
21
 
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/36jb5bgyYCoVKx3NE8Iuv.png)
23
 
 
24
 
25
+ To further enhance its performance in Vietnamese while maintaining robust capabilities on existing English datasets, **Vintern-1B-v3.5** has been fine-tuned using a vast amount of Vietnamese-specific data. This results in a model that is exceptionally powerful in text recognition, OCR, and understanding Vietnam-specific documents.
26
 
27
+ Key Features 🌟
28
 
29
+ - Top Quality for Vietnamese Texts
30
+ Vintern-1B-v3.5 is one of the best models in its class (1B parameters) for understanding and processing Vietnamese documents.
 
31
 
32
+ - Better Extraction and Understanding
33
+ The model is great at handling invoices, legal texts, handwriting, and tables.
34
 
35
+ - Runs on Affordable Hardware
36
+ You can run the model on Google Colab with a T4 GPU, making it easy to use without expensive devices.
37
 
38
+ - Easy to Fine-tune
39
+ The model can be customized for specific tasks with minimal effort.
40
 
 
 
41
 
42
  ## Benchmarks πŸ“ˆ
43
 
44
+ | Benchmark | InternVL2_5 1B | Vintern-1B-v2 | Vintern-1B-v3.5 |
45
+ |:-------------:|:--------------:|:-------------:|:---------------:|
46
+ | vi-MTVQA | 24.8 | 37.4 | 41.9 |
47
+ | DocVQAtest | 84.8 | 72.5 | 78.8 |
48
+ | MMMUval | 40.9 | 31.3 | 32.4 |
49
+ | InfoVQAtest | 56.0 | 38.9 | 46.4 |
50
+ | TextVQAval | 72.0 | 64.0 | 68.2 |
51
+ | ChartQAtest | 75.9 | 34.1 | 60.0 |
52
+ | OCRBench | 785 | 628 | 706 |
53
+ | MMEsum | 1950 | 1185 | 1346 |
54
 
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6336b5c831efcb5647f00170/DrUCZuXuMz47uVU4zqnJ4.png)
56
 
57
  ## Examples
58