czczup commited on
Commit
e706259
ยท
verified ยท
1 Parent(s): edaa746

fix compatibility issue for transformers 4.46+

Browse files
Files changed (2) hide show
  1. README.md +18 -278
  2. configuration_internvl_chat.py +2 -2
README.md CHANGED
@@ -5,6 +5,7 @@ library_name: transformers
5
  base_model:
6
  - OpenGVLab/InternViT-300M-448px
7
  - internlm/internlm2_5-7b-chat
 
8
  base_model_relation: merge
9
  language:
10
  - multilingual
@@ -19,13 +20,13 @@ tags:
19
 
20
  # InternVL2-8B
21
 
22
- [\[๐Ÿ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[๐Ÿ†• Blog\]](https://internvl.github.io/blog/) [\[๐Ÿ“œ InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[๐Ÿ“œ InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
23
 
24
  [\[๐Ÿ—จ๏ธ Chat Demo\]](https://internvl.opengvlab.com/) [\[๐Ÿค— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[๐Ÿš€ Quick Start\]](#quick-start) [\[๐Ÿ“– ไธญๆ–‡่งฃ่ฏป\]](https://zhuanlan.zhihu.com/p/706547971) [\[๐Ÿ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
25
 
26
- [ๅˆ‡ๆข่‡ณไธญๆ–‡็‰ˆ](#็ฎ€ไป‹)
27
-
28
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/_mLpMwsav5eMeNcZdrIQl.png)
29
 
30
  ## Introduction
31
 
@@ -65,7 +66,7 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
65
  | MME<sub>sum</sub> | 2024.6 | 2187.8 | 2210.3 |
66
  | RealWorldQA | 63.5 | 66.0 | 64.4 |
67
  | AI2D<sub>test</sub> | 78.4 | 80.7 | 83.8 |
68
- | MMMU<sub>val</sub> | 45.8 | 45.2 / 46.8 | 49.3 / 51.8 |
69
  | MMBench-EN<sub>test</sub> | 77.2 | 82.2 | 81.7 |
70
  | MMBench-CN<sub>test</sub> | 74.2 | 82.0 | 81.2 |
71
  | CCBench<sub>dev</sub> | 45.9 | 69.8 | 75.9 |
@@ -78,9 +79,7 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
78
 
79
  - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
80
 
81
- - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
82
-
83
- - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
84
 
85
  - Please note that evaluating the same model using different testing toolkits like [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
86
 
@@ -130,7 +129,7 @@ We provide an example code to run InternVL2-8B using `transformers`.
130
 
131
  We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
132
 
133
- > Please use transformers==4.37.2 to ensure the model works normally.
134
 
135
  ### Model Loading
136
 
@@ -442,7 +441,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
442
  print(f'User: {question}\nAssistant: {response}')
443
  ```
444
 
445
- #### Streaming output
446
 
447
  Besides this method, you can also use the following code to get streamed output.
448
 
@@ -482,12 +481,12 @@ Many repositories now support fine-tuning of the InternVL series models, includi
482
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
483
 
484
  ```sh
485
- pip install lmdeploy==0.5.3
486
  ```
487
 
488
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
489
 
490
- #### A 'Hello, world' example
491
 
492
  ```python
493
  from lmdeploy import pipeline, TurbomindEngineConfig
@@ -502,7 +501,7 @@ print(response.text)
502
 
503
  If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
504
 
505
- #### Multi-images inference
506
 
507
  When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
508
 
@@ -527,7 +526,7 @@ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe thes
527
  print(response.text)
528
  ```
529
 
530
- #### Batch prompts inference
531
 
532
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
533
 
@@ -547,7 +546,7 @@ response = pipe(prompts)
547
  print(response)
548
  ```
549
 
550
- #### Multi-turn conversation
551
 
552
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
553
 
@@ -617,271 +616,12 @@ This project is released under the MIT license, while InternLM2 is licensed unde
617
  If you find this project useful in your research, please consider citing:
618
 
619
  ```BibTeX
620
- @article{chen2023internvl,
621
- title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
622
- author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
623
- journal={arXiv preprint arXiv:2312.14238},
624
- year={2023}
625
- }
626
- @article{chen2024far,
627
- title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
628
- author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
629
- journal={arXiv preprint arXiv:2404.16821},
630
  year={2024}
631
  }
632
- ```
633
-
634
- ## ็ฎ€ไป‹
635
-
636
- ๆˆ‘ไปฌๅพˆ้ซ˜ๅ…ดๅฎฃๅธƒ InternVL 2.0 ็š„ๅ‘ๅธƒ๏ผŒ่ฟ™ๆ˜ฏ InternVL ็ณปๅˆ—ๅคšๆจกๆ€ๅคง่ฏญ่จ€ๆจกๅž‹็š„ๆœ€ๆ–ฐ็‰ˆๆœฌใ€‚InternVL 2.0 ๆไพ›ไบ†ๅคš็ง**ๆŒ‡ไปคๅพฎ่ฐƒ**็š„ๆจกๅž‹๏ผŒๅ‚ๆ•ฐไปŽ 10 ไบฟๅˆฐ 1080 ไบฟไธ็ญ‰ใ€‚ๆญคไป“ๅบ“ๅŒ…ๅซ็ป่ฟ‡ๆŒ‡ไปคๅพฎ่ฐƒ็š„ InternVL2-8B ๆจกๅž‹ใ€‚
637
-
638
- ไธŽๆœ€ๅ…ˆ่ฟ›็š„ๅผ€ๆบๅคšๆจกๆ€ๅคง่ฏญ่จ€ๆจกๅž‹็›ธๆฏ”๏ผŒInternVL 2.0 ่ถ…่ถŠไบ†ๅคงๅคšๆ•ฐๅผ€ๆบๆจกๅž‹ใ€‚ๅฎƒๅœจๅ„็ง่ƒฝๅŠ›ไธŠ่กจ็Žฐๅ‡บไธŽ้—ญๆบๅ•†ไธšๆจกๅž‹็›ธๅชฒ็พŽ็š„็ซžไบ‰ๅŠ›๏ผŒๅŒ…ๆ‹ฌๆ–‡ๆกฃๅ’Œๅ›พ่กจ็†่งฃใ€ไฟกๆฏๅ›พ่กจ้—ฎ็ญ”ใ€ๅœบๆ™ฏๆ–‡ๆœฌ็†่งฃๅ’Œ OCR ไปปๅŠกใ€็ง‘ๅญฆๅ’Œๆ•ฐๅญฆ้—ฎ้ข˜่งฃๅ†ณ๏ผŒไปฅๅŠๆ–‡ๅŒ–็†่งฃๅ’Œ็ปผๅˆๅคšๆจกๆ€่ƒฝๅŠ›ใ€‚
639
-
640
- InternVL 2.0 ไฝฟ็”จ 8k ไธŠไธ‹ๆ–‡็ช—ๅฃ่ฟ›่กŒ่ฎญ็ปƒ๏ผŒ่ฎญ็ปƒๆ•ฐๆฎๅŒ…ๅซ้•ฟๆ–‡ๆœฌใ€ๅคšๅ›พๅ’Œ่ง†้ข‘ๆ•ฐๆฎ๏ผŒไธŽ InternVL 1.5 ็›ธๆฏ”๏ผŒๅ…ถๅค„็†่ฟ™ไบ›็ฑปๅž‹่พ“ๅ…ฅ็š„่ƒฝๅŠ›ๆ˜พ่‘—ๆ้ซ˜ใ€‚ๆ›ดๅคš่ฏฆ็ป†ไฟกๆฏ๏ผŒ่ฏทๅ‚้˜…ๆˆ‘ไปฌ็š„ๅšๅฎขๅ’Œ GitHubใ€‚
641
-
642
- | ๆจกๅž‹ๅ็งฐ | ่ง†่ง‰้ƒจๅˆ† | ่ฏญ่จ€้ƒจๅˆ† | HF ้“พๆŽฅ | MS ้“พๆŽฅ |
643
- | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------: | :--------------------------------------------------------------------: |
644
- | InternVL2-1B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-1B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-1B) |
645
- | InternVL2-2B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-2B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-2B) |
646
- | InternVL2-4B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-4B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-4B) |
647
- | InternVL2-8B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-8B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-8B) |
648
- | InternVL2-26B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [internlm2-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-26B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-26B) |
649
- | InternVL2-40B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-40B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-40B) |
650
- | InternVL2-Llama3-76B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Hermes-2-Theta-Llama-3-70B](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B) | [๐Ÿค— link](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) | [๐Ÿค– link](https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B) |
651
-
652
- ## ๆจกๅž‹็ป†่Š‚
653
-
654
- InternVL 2.0 ๆ˜ฏไธ€ไธชๅคšๆจกๆ€ๅคง่ฏญ่จ€ๆจกๅž‹็ณปๅˆ—๏ผŒๅŒ…ๅซๅ„็ง่ง„ๆจก็š„ๆจกๅž‹ใ€‚ๅฏนไบŽๆฏไธช่ง„ๆจก็š„ๆจกๅž‹๏ผŒๆˆ‘ไปฌ้ƒฝไผšๅ‘ๅธƒ้’ˆๅฏนๅคšๆจกๆ€ไปปๅŠกไผ˜ๅŒ–็š„ๆŒ‡ไปคๅพฎ่ฐƒๆจกๅž‹ใ€‚InternVL2-8B ๅŒ…ๅซ [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)ใ€ไธ€ไธช MLP ๆŠ•ๅฝฑๅ™จๅ’Œ [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)ใ€‚
655
-
656
- ## ๆ€ง่ƒฝๆต‹่ฏ•
657
-
658
- ### ๅ›พๅƒ็›ธๅ…ณ่ฏ„ๆต‹
659
-
660
- | ่ฏ„ๆต‹ๆ•ฐๆฎ้›† | MiniCPM-Llama3-V-2_5 | InternVL-Chat-V1-5 | InternVL2-8B |
661
- | :--------------------------: | :------------------: | :----------------: | :----------: |
662
- | ๆจกๅž‹ๅคงๅฐ | 8.5B | 25.5B | 8.1B |
663
- | | | | |
664
- | DocVQA<sub>test</sub> | 84.8 | 90.9 | 91.6 |
665
- | ChartQA<sub>test</sub> | - | 83.8 | 83.3 |
666
- | InfoVQA<sub>test</sub> | - | 72.5 | 74.8 |
667
- | TextVQA<sub>val</sub> | 76.6 | 80.6 | 77.4 |
668
- | OCRBench | 725 | 724 | 794 |
669
- | MME<sub>sum</sub> | 2024.6 | 2187.8 | 2210.3 |
670
- | RealWorldQA | 63.5 | 66.0 | 64.4 |
671
- | AI2D<sub>test</sub> | 78.4 | 80.7 | 83.8 |
672
- | MMMU<sub>val</sub> | 45.8 | 45.2 / 46.8 | 49.3 / 51.8 |
673
- | MMBench-EN<sub>test</sub> | 77.2 | 82.2 | 81.7 |
674
- | MMBench-CN<sub>test</sub> | 74.2 | 82.0 | 81.2 |
675
- | CCBench<sub>dev</sub> | 45.9 | 69.8 | 75.9 |
676
- | MMVet<sub>GPT-4-0613</sub> | - | 62.8 | 60.0 |
677
- | MMVet<sub>GPT-4-Turbo</sub> | 52.8 | 55.4 | 54.2 |
678
- | SEED-Image | 72.3 | 76.0 | 76.2 |
679
- | HallBench<sub>avg</sub> | 42.4 | 49.3 | 45.2 |
680
- | MathVista<sub>testmini</sub> | 54.3 | 53.5 | 58.3 |
681
- | OpenCompass<sub>avg</sub> | 58.8 | 61.7 | 64.1 |
682
-
683
- - ๅ…ณไบŽๆ›ดๅคš็š„็ป†่Š‚ไปฅๅŠ่ฏ„ๆต‹ๅค็Žฐ๏ผŒ่ฏท็œ‹ๆˆ‘ไปฌ็š„[่ฏ„ๆต‹ๆŒ‡ๅ—](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)ใ€‚
684
-
685
- - ๆˆ‘ไปฌๅŒๆ—ถไฝฟ็”จ InternVL ๅ’Œ VLMEvalKit ไป“ๅบ“่ฟ›่กŒๆจกๅž‹่ฏ„ไผฐใ€‚ๅ…ทไฝ“ๆฅ่ฏด๏ผŒDocVQAใ€ChartQAใ€InfoVQAใ€TextVQAใ€MMEใ€AI2Dใ€MMBenchใ€CCBenchใ€MMVet ๅ’Œ SEED-Image ็š„็ป“ๆžœๆ˜ฏไฝฟ็”จ InternVL ไป“ๅบ“ๆต‹่ฏ•็š„ใ€‚OCRBenchใ€RealWorldQAใ€HallBench ๅ’Œ MathVista ๆ˜ฏไฝฟ็”จ VLMEvalKit ่ฟ›่กŒ่ฏ„ไผฐ็š„ใ€‚
686
-
687
- - ๅฏนไบŽMMMU๏ผŒๆˆ‘ไปฌๆŠฅๅ‘Šไบ†ๅŽŸๅง‹ๅˆ†ๆ•ฐ๏ผˆๅทฆไพง๏ผšInternVL็ณปๅˆ—ๆจกๅž‹ไฝฟ็”จInternVLไปฃ็ ๅบ“่ฏ„ๆต‹๏ผŒๅ…ถไป–ๆจกๅž‹็š„ๅˆ†ๆ•ฐๆฅ่‡ชๅ…ถๆŠ€ๆœฏๆŠฅๅ‘Šๆˆ–็ฝ‘้กต๏ผ‰ๅ’ŒVLMEvalKitๅˆ†ๆ•ฐ๏ผˆๅณไพง๏ผšไปŽOpenCompassๆŽ’่กŒๆฆœๆ”ถ้›†๏ผ‰ใ€‚
688
-
689
- - ่ฏทๆณจๆ„๏ผŒไฝฟ็”จไธๅŒ็š„ๆต‹่ฏ•ๅทฅๅ…ทๅŒ…๏ผˆๅฆ‚ InternVL ๅ’Œ VLMEvalKit๏ผ‰่ฏ„ไผฐๅŒไธ€ๆจกๅž‹ๅฏ่ƒฝไผšๅฏผ่‡ด็ป†ๅพฎๅทฎๅผ‚๏ผŒ่ฟ™ๆ˜ฏๆญฃๅธธ็š„ใ€‚ไปฃ็ ็‰ˆๆœฌ็š„ๆ›ดๆ–ฐใ€็Žฏๅขƒๅ’Œ็กฌไปถ็š„ๅ˜ๅŒ–ไนŸๅฏ่ƒฝๅฏผ่‡ด็ป“ๆžœ็š„ๅพฎๅฐๅทฎๅผ‚ใ€‚
690
-
691
- ### ่ง†้ข‘็›ธๅ…ณ่ฏ„ๆต‹
692
-
693
- | ่ฏ„ๆต‹ๆ•ฐๆฎ้›† | VideoChat2-HD-Mistral | Video-CCAM-9B | InternVL2-4B | InternVL2-8B |
694
- | :-------------------------: | :-------------------: | :-----------: | :----------: | :----------: |
695
- | ๆจกๅž‹ๅคงๅฐ | 7B | 9B | 4.2B | 8.1B |
696
- | | | | | |
697
- | MVBench | 60.4 | 60.7 | 63.7 | 66.4 |
698
- | MMBench-Video<sub>8f</sub> | - | - | 1.10 | 1.19 |
699
- | MMBench-Video<sub>16f</sub> | - | - | 1.18 | 1.28 |
700
- | Video-MME<br>w/o subs | 42.3 | 50.6 | 51.4 | 54.0 |
701
- | Video-MME<br>w subs | 54.6 | 54.9 | 53.4 | 56.9 |
702
-
703
- - ๆˆ‘ไปฌ้€š่ฟ‡ไปŽๆฏไธช่ง†้ข‘ไธญๆๅ– 16 ๅธงๆฅ่ฏ„ไผฐๆˆ‘ไปฌ็š„ๆจกๅž‹ๅœจ MVBench ๅ’Œ Video-MME ไธŠ็š„ๆ€ง่ƒฝ๏ผŒๆฏไธช่ง†้ข‘ๅธง่ขซ่ฐƒๆ•ดไธบ 448x448 ็š„ๅ›พๅƒใ€‚
704
-
705
- ### ๅฎšไฝ็›ธๅ…ณ่ฏ„ๆต‹
706
-
707
- | ๆจกๅž‹ | avg. | RefCOCO<br>(val) | RefCOCO<br>(testA) | RefCOCO<br>(testB) | RefCOCO+<br>(val) | RefCOCO+<br>(testA) | RefCOCO+<br>(testB) | RefCOCOโ€‘g<br>(val) | RefCOCOโ€‘g<br>(test) |
708
- | :----------------------------: | :--: | :--------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: | :----------------: | :-----------------: |
709
- | UNINEXT-H<br>(Specialist SOTA) | 88.9 | 92.6 | 94.3 | 91.5 | 85.2 | 89.6 | 79.8 | 88.7 | 89.4 |
710
- | | | | | | | | | | |
711
- | Mini-InternVL-<br>Chat-2B-V1-5 | 75.8 | 80.7 | 86.7 | 72.9 | 72.5 | 82.3 | 60.8 | 75.6 | 74.9 |
712
- | Mini-InternVL-<br>Chat-4B-V1-5 | 84.4 | 88.0 | 91.4 | 83.5 | 81.5 | 87.4 | 73.8 | 84.7 | 84.6 |
713
- | InternVLโ€‘Chatโ€‘V1โ€‘5 | 88.8 | 91.4 | 93.7 | 87.1 | 87.0 | 92.3 | 80.9 | 88.5 | 89.3 |
714
- | | | | | | | | | | |
715
- | InternVL2โ€‘1B | 79.9 | 83.6 | 88.7 | 79.8 | 76.0 | 83.6 | 67.7 | 80.2 | 79.9 |
716
- | InternVL2โ€‘2B | 77.7 | 82.3 | 88.2 | 75.9 | 73.5 | 82.8 | 63.3 | 77.6 | 78.3 |
717
- | InternVL2โ€‘4B | 84.4 | 88.5 | 91.2 | 83.9 | 81.2 | 87.2 | 73.8 | 84.6 | 84.6 |
718
- | InternVL2โ€‘8B | 82.9 | 87.1 | 91.1 | 80.7 | 79.8 | 87.9 | 71.4 | 82.7 | 82.7 |
719
- | InternVL2โ€‘26B | 88.5 | 91.2 | 93.3 | 87.4 | 86.8 | 91.0 | 81.2 | 88.5 | 88.6 |
720
- | InternVL2โ€‘40B | 90.3 | 93.0 | 94.7 | 89.2 | 88.5 | 92.8 | 83.6 | 90.3 | 90.6 |
721
- | InternVL2-<br>Llama3โ€‘76B | 90.0 | 92.2 | 94.8 | 88.4 | 88.8 | 93.1 | 82.8 | 89.5 | 90.3 |
722
-
723
- - ๆˆ‘ไปฌไฝฟ็”จไปฅไธ‹ Prompt ๆฅ่ฏ„ๆต‹ InternVL ็š„ Grounding ่ƒฝๅŠ›: `Please provide the bounding box coordinates of the region this sentence describes: <ref>{}</ref>`
724
-
725
- ้™ๅˆถ๏ผšๅฐฝ็ฎกๅœจ่ฎญ็ปƒ่ฟ‡็จ‹ไธญๆˆ‘ไปฌ้žๅธธๆณจ้‡ๆจกๅž‹็š„ๅฎ‰ๅ…จๆ€ง๏ผŒๅฐฝๅŠ›ไฟƒไฝฟๆจกๅž‹่พ“ๅ‡บ็ฌฆๅˆไผฆ็†ๅ’Œๆณ•ๅพ‹่ฆๆฑ‚็š„ๆ–‡ๆœฌ๏ผŒไฝ†ๅ—้™ไบŽๆจกๅž‹ๅคงๅฐไปฅๅŠๆฆ‚็Ž‡็”Ÿๆˆ่Œƒๅผ๏ผŒๆจกๅž‹ๅฏ่ƒฝไผšไบง็”Ÿๅ„็งไธ็ฌฆๅˆ้ข„ๆœŸ็š„่พ“ๅ‡บ๏ผŒไพ‹ๅฆ‚ๅ›žๅคๅ†…ๅฎนๅŒ…ๅซๅ่งใ€ๆญง่ง†็ญ‰ๆœ‰ๅฎณๅ†…ๅฎน๏ผŒ่ฏทๅ‹ฟไผ ๆ’ญ่ฟ™ไบ›ๅ†…ๅฎนใ€‚็”ฑไบŽไผ ๆ’ญไธ่‰ฏไฟกๆฏๅฏผ่‡ด็š„ไปปไฝ•ๅŽๆžœ๏ผŒๆœฌ้กน็›ฎไธๆ‰ฟๆ‹…่ดฃไปปใ€‚
726
-
727
- ### ้‚€่ฏท่ฏ„ๆต‹ InternVL
728
-
729
- ๆˆ‘ไปฌๆฌข่ฟŽๅ„ไฝ MLLM benchmark ็š„ๅผ€ๅ‘่€…ๅฏนๆˆ‘ไปฌ็š„ InternVL1.5 ไปฅๅŠ InternVL2 ็ณปๅˆ—ๆจกๅž‹่ฟ›่กŒ่ฏ„ๆต‹ใ€‚ๅฆ‚ๆžœ้œ€่ฆๅœจๆญคๅค„ๆทปๅŠ ่ฏ„ๆต‹็ป“ๆžœ๏ผŒ่ฏทไธŽๆˆ‘่”็ณป๏ผˆ[[email protected]](mailto:[email protected])๏ผ‰ใ€‚
730
-
731
- ## ๅฟซ้€ŸๅฏๅŠจ
732
-
733
- ๆˆ‘ไปฌๆไพ›ไบ†ไธ€ไธช็คบไพ‹ไปฃ็ ๏ผŒ็”จไบŽไฝฟ็”จ `transformers` ่ฟ่กŒ InternVL2-8Bใ€‚
734
-
735
- ๆˆ‘ไปฌไนŸๆฌข่ฟŽไฝ ๅœจๆˆ‘ไปฌ็š„[ๅœจ็บฟdemo](https://internvl.opengvlab.com/)ไธญไฝ“้ชŒInternVL2็š„็ณปๅˆ—ๆจกๅž‹ใ€‚
736
-
737
- > ่ฏทไฝฟ็”จ transformers==4.37.2 ไปฅ็กฎไฟๆจกๅž‹ๆญฃๅธธ่ฟ่กŒใ€‚
738
-
739
- ็คบไพ‹ไปฃ็ ่ฏท[็‚นๅ‡ป่ฟ™้‡Œ](#quick-start)ใ€‚
740
-
741
- ## ๅพฎ่ฐƒ
742
-
743
- ่ฎธๅคšไป“ๅบ“็Žฐๅœจ้ƒฝๆ”ฏๆŒ InternVL ็ณปๅˆ—ๆจกๅž‹็š„ๅพฎ่ฐƒ๏ผŒๅŒ…ๆ‹ฌ [InternVL](https://github.com/OpenGVLab/InternVL)ใ€[SWIFT](https://github.com/modelscope/ms-swift)ใ€[XTurner](https://github.com/InternLM/xtuner) ็ญ‰ใ€‚่ฏทๅ‚้˜…ๅฎƒไปฌ็š„ๆ–‡ๆกฃไปฅ่Žทๅ–ๆ›ดๅคšๅพฎ่ฐƒ็ป†่Š‚ใ€‚
744
-
745
- ## ้ƒจ็ฝฒ
746
-
747
- ### LMDeploy
748
-
749
- LMDeploy ๆ˜ฏ็”ฑ MMRazor ๅ’Œ MMDeploy ๅ›ข้˜Ÿๅผ€ๅ‘็š„็”จไบŽๅŽ‹็ผฉใ€้ƒจ็ฝฒๅ’ŒๆœๅŠกๅคง่ฏญ่จ€ๆจกๅž‹๏ผˆLLM๏ผ‰็š„ๅทฅๅ…ทๅŒ…ใ€‚
750
-
751
- ```sh
752
- pip install lmdeploy==0.5.3
753
- ```
754
-
755
- LMDeploy ๅฐ†ๅคšๆจกๆ€่ง†่ง‰-่ฏญ่จ€ๆจกๅž‹๏ผˆVLM๏ผ‰็š„ๅคๆ‚ๆŽจ็†่ฟ‡็จ‹ๆŠฝ่ฑกไธบไธ€ไธชๆ˜“ไบŽไฝฟ็”จ็š„็ฎก้“๏ผŒ็ฑปไผผไบŽๅคง่ฏญ่จ€ๆจกๅž‹๏ผˆLLM๏ผ‰็š„ๆŽจ็†็ฎก้“ใ€‚
756
-
757
- #### ไธ€ไธชโ€œไฝ ๅฅฝ๏ผŒไธ–็•Œโ€็คบไพ‹
758
-
759
- ```python
760
- from lmdeploy import pipeline, TurbomindEngineConfig
761
- from lmdeploy.vl import load_image
762
-
763
- model = 'OpenGVLab/InternVL2-8B'
764
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
765
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
766
- response = pipe(('describe this image', image))
767
- print(response.text)
768
- ```
769
-
770
- ๅฆ‚ๆžœๅœจๆ‰ง่กŒๆญค็คบไพ‹ๆ—ถๅ‡บ็Žฐ `ImportError`๏ผŒ่ฏทๆŒ‰็…งๆ็คบๅฎ‰่ฃ…ๆ‰€้œ€็š„ไพ่ต–ๅŒ…ใ€‚
771
-
772
- #### ๅคšๅ›พๅƒๆŽจ็†
773
-
774
- ๅœจๅค„็†ๅคšๅผ ๅ›พๅƒๆ—ถ๏ผŒๅฏไปฅๅฐ†ๅฎƒไปฌๅ…จ้ƒจๆ”พๅ…ฅไธ€ไธชๅˆ—่กจไธญใ€‚่ฏทๆณจๆ„๏ผŒๅคšๅผ ๅ›พๅƒไผšๅฏผ่‡ด่พ“ๅ…ฅ token ๆ•ฐ้‡ๅขžๅŠ ๏ผŒๅ› ๆญค้€šๅธธ้œ€่ฆๅขžๅŠ ไธŠไธ‹ๆ–‡็ช—ๅฃ็š„ๅคงๅฐใ€‚
775
-
776
- ```python
777
- from lmdeploy import pipeline, TurbomindEngineConfig
778
- from lmdeploy.vl import load_image
779
- from lmdeploy.vl.constants import IMAGE_TOKEN
780
-
781
- model = 'OpenGVLab/InternVL2-8B'
782
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
783
-
784
- image_urls=[
785
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
786
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
787
- ]
788
-
789
- images = [load_image(img_url) for img_url in image_urls]
790
- # Numbering images improves multi-image conversations
791
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
792
- print(response.text)
793
- ```
794
-
795
- #### ๆ‰น้‡PromptๆŽจ็†
796
-
797
- ไฝฟ็”จๆ‰น้‡Prompt่ฟ›่กŒๆŽจ็†้žๅธธ็ฎ€ๅ•๏ผ›ๅช้œ€ๅฐ†ๅฎƒไปฌๆ”พๅœจไธ€ไธชๅˆ—่กจ็ป“ๆž„ไธญ๏ผš
798
-
799
- ```python
800
- from lmdeploy import pipeline, TurbomindEngineConfig
801
- from lmdeploy.vl import load_image
802
-
803
- model = 'OpenGVLab/InternVL2-8B'
804
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
805
-
806
- image_urls=[
807
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
808
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
809
- ]
810
- prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
811
- response = pipe(prompts)
812
- print(response)
813
- ```
814
-
815
- #### ๅคš่ฝฎๅฏน่ฏ
816
-
817
- ไฝฟ็”จ็ฎก้“่ฟ›่กŒๅคš่ฝฎๅฏน่ฏๆœ‰ไธค็งๆ–นๆณ•ใ€‚ไธ€็งๆ˜ฏๆ นๆฎ OpenAI ็š„ๆ ผๅผๆž„ๅปบๆถˆๆฏๅนถไฝฟ็”จไธŠ่ฟฐๆ–นๆณ•๏ผŒๅฆไธ€็งๆ˜ฏไฝฟ็”จ `pipeline.chat` ๆŽฅๅฃใ€‚
818
-
819
- ```python
820
- from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
821
- from lmdeploy.vl import load_image
822
-
823
- model = 'OpenGVLab/InternVL2-8B'
824
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
825
-
826
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
827
- gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
828
- sess = pipe.chat(('describe this image', image), gen_config=gen_config)
829
- print(sess.response.text)
830
- sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
831
- print(sess.response.text)
832
- ```
833
-
834
- #### API้ƒจ็ฝฒ
835
-
836
- LMDeploy ็š„ `api_server` ไฝฟๆจกๅž‹่ƒฝๅคŸ้€š่ฟ‡ไธ€ไธชๅ‘ฝไปค่ฝปๆพๆ‰“ๅŒ…ๆˆๆœๅŠกใ€‚ๆไพ›็š„ RESTful API ไธŽ OpenAI ็š„ๆŽฅๅฃๅ…ผๅฎนใ€‚ไปฅไธ‹ๆ˜ฏๆœๅŠกๅฏๅŠจ็š„็คบไพ‹๏ผš
837
-
838
- ```shell
839
- lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333
840
- ```
841
-
842
- ไธบไบ†ไฝฟ็”จOpenAI้ฃŽๆ ผ็š„APIๆŽฅๅฃ๏ผŒๆ‚จ้œ€่ฆๅฎ‰่ฃ…OpenAI:
843
-
844
- ```shell
845
- pip install openai
846
- ```
847
-
848
- ็„ถๅŽ๏ผŒไฝฟ็”จไธ‹้ข็š„ไปฃ็ ่ฟ›่กŒAPI่ฐƒ็”จ:
849
-
850
- ```python
851
- from openai import OpenAI
852
-
853
- client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
854
- model_name = client.models.list().data[0].id
855
- response = client.chat.completions.create(
856
- model=model_name,
857
- messages=[{
858
- 'role':
859
- 'user',
860
- 'content': [{
861
- 'type': 'text',
862
- 'text': 'describe this image',
863
- }, {
864
- 'type': 'image_url',
865
- 'image_url': {
866
- 'url':
867
- 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
868
- },
869
- }],
870
- }],
871
- temperature=0.8,
872
- top_p=0.8)
873
- print(response)
874
- ```
875
-
876
- ## ๅผ€ๆบ่ฎธๅฏ่ฏ
877
-
878
- ่ฏฅ้กน็›ฎ้‡‡็”จ MIT ่ฎธ๏ฟฝ๏ฟฝ่ฏๅ‘ๅธƒ๏ผŒ่€Œ InternLM2 ๅˆ™้‡‡็”จ Apache-2.0 ่ฎธๅฏ่ฏใ€‚
879
-
880
- ## ๅผ•็”จ
881
-
882
- ๅฆ‚ๆžœๆ‚จๅ‘็Žฐๆญค้กน็›ฎๅฏนๆ‚จ็š„็ ”็ฉถๆœ‰็”จ๏ผŒๅฏไปฅ่€ƒ่™‘ๅผ•็”จๆˆ‘ไปฌ็š„่ฎบๆ–‡๏ผš
883
-
884
- ```BibTeX
885
  @article{chen2023internvl,
886
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
887
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
 
5
  base_model:
6
  - OpenGVLab/InternViT-300M-448px
7
  - internlm/internlm2_5-7b-chat
8
+ new_version: OpenGVLab/InternVL2_5-8B
9
  base_model_relation: merge
10
  language:
11
  - multilingual
 
20
 
21
  # InternVL2-8B
22
 
23
+ [\[๐Ÿ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[๐Ÿ†• Blog\]](https://internvl.github.io/blog/) [\[๐Ÿ“œ InternVL 1.0\]](https://arxiv.org/abs/2312.14238) [\[๐Ÿ“œ InternVL 1.5\]](https://arxiv.org/abs/2404.16821) [\[๐Ÿ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261)
24
 
25
  [\[๐Ÿ—จ๏ธ Chat Demo\]](https://internvl.opengvlab.com/) [\[๐Ÿค— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[๐Ÿš€ Quick Start\]](#quick-start) [\[๐Ÿ“– ไธญๆ–‡่งฃ่ฏป\]](https://zhuanlan.zhihu.com/p/706547971) [\[๐Ÿ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
26
 
27
+ <div align="center">
28
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
29
+ </div>
30
 
31
  ## Introduction
32
 
 
66
  | MME<sub>sum</sub> | 2024.6 | 2187.8 | 2210.3 |
67
  | RealWorldQA | 63.5 | 66.0 | 64.4 |
68
  | AI2D<sub>test</sub> | 78.4 | 80.7 | 83.8 |
69
+ | MMMU<sub>val</sub> | 45.8 | 46.8 | 51.8 |
70
  | MMBench-EN<sub>test</sub> | 77.2 | 82.2 | 81.7 |
71
  | MMBench-CN<sub>test</sub> | 74.2 | 82.0 | 81.2 |
72
  | CCBench<sub>dev</sub> | 45.9 | 69.8 | 75.9 |
 
79
 
80
  - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
81
 
82
+ - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet (GPT-4-0613), and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, MMVet (GPT-4-Turbo), and MathVista were evaluated using the VLMEvalKit.
 
 
83
 
84
  - Please note that evaluating the same model using different testing toolkits like [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
85
 
 
129
 
130
  We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
131
 
132
+ > Please use transformers>=4.37.2 to ensure the model works normally.
133
 
134
  ### Model Loading
135
 
 
441
  print(f'User: {question}\nAssistant: {response}')
442
  ```
443
 
444
+ #### Streaming Output
445
 
446
  Besides this method, you can also use the following code to get streamed output.
447
 
 
481
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
482
 
483
  ```sh
484
+ pip install lmdeploy>=0.5.3
485
  ```
486
 
487
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
488
 
489
+ #### A 'Hello, world' Example
490
 
491
  ```python
492
  from lmdeploy import pipeline, TurbomindEngineConfig
 
501
 
502
  If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
503
 
504
+ #### Multi-images Inference
505
 
506
  When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
507
 
 
526
  print(response.text)
527
  ```
528
 
529
+ #### Batch Prompts Inference
530
 
531
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
532
 
 
546
  print(response)
547
  ```
548
 
549
+ #### Multi-turn Conversation
550
 
551
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
552
 
 
616
  If you find this project useful in your research, please consider citing:
617
 
618
  ```BibTeX
619
+ @article{gao2024mini,
620
+ title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
621
+ author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
622
+ journal={arXiv preprint arXiv:2410.16261},
 
 
 
 
 
 
623
  year={2024}
624
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
625
  @article{chen2023internvl,
626
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
627
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
configuration_internvl_chat.py CHANGED
@@ -39,11 +39,11 @@ class InternVLChatConfig(PretrainedConfig):
39
  super().__init__(**kwargs)
40
 
41
  if vision_config is None:
42
- vision_config = {}
43
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
44
 
45
  if llm_config is None:
46
- llm_config = {}
47
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
 
49
  self.vision_config = InternVisionConfig(**vision_config)
 
39
  super().__init__(**kwargs)
40
 
41
  if vision_config is None:
42
+ vision_config = {'architectures': ['InternVisionModel']}
43
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
44
 
45
  if llm_config is None:
46
+ llm_config = {'architectures': ['InternLM2ForCausalLM']}
47
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
 
49
  self.vision_config = InternVisionConfig(**vision_config)