czczup commited on
Commit
be6640a
·
verified ·
1 Parent(s): 828dc2a

Update README.md

Browse files
README.md CHANGED
@@ -1,298 +1,89 @@
1
  ---
2
  license: mit
3
- datasets:
4
- - laion/laion2B-en
5
- - laion/laion-coco
6
- - laion/laion2B-multi
7
- - kakaobrain/coyo-700m
8
- - conceptual_captions
9
- - wanng/wukong100m
10
  pipeline_tag: image-feature-extraction
 
 
11
  ---
12
 
13
  # InternViT-6B-448px-V2_5
14
 
15
- [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[📜 Mini-InternVL Report\]](https://arxiv.org/abs/2410.16261)
16
 
17
  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
18
 
19
- InternViT-6B has been updated.
20
- We develop InternViT-6B-448px-V2_5 based on the pre-training of the strong foundation of InternViT-6B-448px-V1-5. Through ViT Incremental Learning with NTP loss, the vision encoder has gained a stronger ability to extract visual features, allowing it to capture more comprehensive information—especially for domains that are relatively underrepresented in web-scale datasets (e.g., LAION-5B), such as multilingual OCR data and mathematical charts, among others.
21
-
22
  <div align="center">
 
 
23
 
24
- | Model Name | HF Link |
25
- | :---------------------: | :----------------------------------------: |
26
- |InternViT-300M-448px-V2_5|[🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
27
- |InternViT-6B-448px-V2_5|[🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5)|
28
 
29
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- ## Model Details
32
-
33
- - **Model Type:** vision foundation model, feature backbone
34
- - **Model Stats:**
35
- - Params (M): 5540 (the last 3 blocks are discarded)
36
- - Image size: 448 x 448, training with 1 - 12 tiles
37
- - **Pretrain Dataset:** LAION-en, LAION-zh, COYO, GRIT, COCO, TextCaps, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and other OCR-related datasets.
38
- To enhance the OCR capability of the model, we have incorporated additional OCR data alongside the general caption datasets. Specifically, we utilized PaddleOCR to perform Chinese OCR on images from Wukong and English OCR on images from LAION-COCO.
39
- - **Note:** InternViT-6B originally had 48 blocks, and we found that using the output after the fourth-to-last block worked best for MLLM. For ease of use and to save GPU memory, we simply discarded the last 3 blocks. Now, the model has only 45 blocks and the number of parameters has been reduced from 5.9B to 5.5B. Therefore, if you want to build a MLLM based on this model, **please make use of the features from the last layer.**
40
-
41
- ## Performance
42
-
43
-
44
- ### Image Classification
45
-
46
- <table style="width: 100%; text-align: center; border-collapse: collapse;">
47
- <!-- <caption><strong>Image classification performance across different versions of InternViT.</strong><br>
48
- We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. $\Delta$ represents the performance gap between attention pooling probing and linear probing, where a larger $\Delta$ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
49
- </caption> -->
50
- <thead>
51
- <tr>
52
- <th rowspan="2">Model Name</th>
53
- <th rowspan="2">res.</th>
54
- <th colspan="7">Linear Probing</th>
55
- <th colspan="7">Attention Pooling Probing</th>
56
- <th rowspan="2">Δ</th>
57
- </tr>
58
- <tr>
59
- <th>IN-1K</th>
60
- <th>IN-ReaL</th>
61
- <th>IN-V2</th>
62
- <th>IN-A</th>
63
- <th>IN-R</th>
64
- <th>IN-Ske</th>
65
- <th>avg.</th>
66
- <th>IN-1K</th>
67
- <th>IN-ReaL</th>
68
- <th>IN-V2</th>
69
- <th>IN-A</th>
70
- <th>IN-R</th>
71
- <th>IN-Ske</th>
72
- <th>avg.</th>
73
- </tr>
74
- </thead>
75
- <tbody>
76
- <tr style="color: gray;">
77
- <td>InternViT-6B-224px</td>
78
- <td>224</td>
79
- <td>88.2</td>
80
- <td>90.4</td>
81
- <td>79.9</td>
82
- <td>77.5</td>
83
- <td>89.8</td>
84
- <td>69.1</td>
85
- <td style="background-color: #d3d3d3;">82.5</td>
86
- <td>89.2</td>
87
- <td>91.1</td>
88
- <td>82.3</td>
89
- <td>84.7</td>
90
- <td>93.1</td>
91
- <td>72.7</td>
92
- <td style="background-color: #d3d3d3;">85.5</td>
93
- <td>3.0</td>
94
- </tr>
95
- <tr>
96
- <td>InternViT-6B-224px</td>
97
- <td>448</td>
98
- <td>87.8</td>
99
- <td>90.2</td>
100
- <td>79.8</td>
101
- <td>77.2</td>
102
- <td>87.1</td>
103
- <td>65.8</td>
104
- <td style="background-color: #d3d3d3;">81.3</td>
105
- <td>88.8</td>
106
- <td>91.0</td>
107
- <td>82.0</td>
108
- <td>85.4</td>
109
- <td>91.3</td>
110
- <td>70.5</td>
111
- <td style="background-color: #d3d3d3;">84.8</td>
112
- <td>3.5</td>
113
- </tr>
114
- <tr>
115
- <td>InternViT-6B-448px-V1.0</td>
116
- <td>448</td>
117
- <td>87.0</td>
118
- <td>90.0</td>
119
- <td>78.8</td>
120
- <td>77.2</td>
121
- <td>85.5</td>
122
- <td>65.1</td>
123
- <td style="background-color: #ffcccc;">80.6</td>
124
- <td>88.7</td>
125
- <td>91.0</td>
126
- <td>82.0</td>
127
- <td>88.7</td>
128
- <td>92.8</td>
129
- <td>72.0</td>
130
- <td style="background-color: #90ee90;">85.9</td>
131
- <td>5.3</td>
132
- </tr>
133
- <tr>
134
- <td>InternViT-6B-448px-V1.2</td>
135
- <td>448</td>
136
- <td>87.0</td>
137
- <td>89.9</td>
138
- <td>78.5</td>
139
- <td>77.1</td>
140
- <td>83.9</td>
141
- <td>59.7</td>
142
- <td style="background-color: #ffcccc;">79.4</td>
143
- <td>88.6</td>
144
- <td>91.1</td>
145
- <td>82.0</td>
146
- <td>88.7</td>
147
- <td>92.7</td>
148
- <td>71.6</td>
149
- <td style="background-color: #90ee90;">85.8</td>
150
- <td>6.4</td>
151
- </tr>
152
- <tr>
153
- <td>InternViT-6B-448px-V1.5</td>
154
- <td>448</td>
155
- <td>86.5</td>
156
- <td>89.9</td>
157
- <td>78.1</td>
158
- <td>69.8</td>
159
- <td>82.9</td>
160
- <td>60.1</td>
161
- <td style="background-color: #ffcccc;">77.9</td>
162
- <td>88.4</td>
163
- <td>91.2</td>
164
- <td>81.6</td>
165
- <td>86.0</td>
166
- <td>92.2</td>
167
- <td>70.9</td>
168
- <td style="background-color: #90ee90;">85.1</td>
169
- <td>7.2</td>
170
- </tr>
171
- <tr>
172
- <td>InternViT-6B-448px-V2.5</td>
173
- <td>448</td>
174
- <td>86.6</td>
175
- <td>90.1</td>
176
- <td>77.8</td>
177
- <td>73.7</td>
178
- <td>82.7</td>
179
- <td>60.0</td>
180
- <td style="background-color: #ffcccc;">78.5</td>
181
- <td>88.3</td>
182
- <td>91.2</td>
183
- <td>81.3</td>
184
- <td>86.9</td>
185
- <td>92.4</td>
186
- <td>70.8</td>
187
- <td style="background-color: #90ee90;">85.2</td>
188
- <td>6.7</td>
189
- </tr>
190
- </tbody>
191
- </table>
192
-
193
- **Note:** ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capture more complex, nonlinear semantic representations
194
-
195
- ### Semantic Segmentation
196
-
197
- <table style="width:100%; border-collapse: collapse; text-align: center;">
198
- <thead>
199
- <tr>
200
- <th rowspan="2">Model Name</th>
201
- <th colspan="3" >Linear Probing</th>
202
- <th colspan="3" >Head Tuning (UperNet)</th>
203
- <th colspan="3" >Full Tuning (UperNet)</th>
204
- <th rowspan="2">Δ<sub>1</sub></th>
205
- <th rowspan="2">Δ<sub>2</sub></th>
206
- </tr>
207
- <tr>
208
- <th>ADE20K</th>
209
- <th>COCO</th>
210
- <th>avg.</th>
211
- <th>ADE20K</th>
212
- <th>COCO</th>
213
- <th>avg.</th>
214
- <th>ADE20K</th>
215
- <th>COCO</th>
216
- <th>avg.</th>
217
- </tr>
218
- </thead>
219
- <tbody>
220
- <tr>
221
- <td>InternViT-6B-224px</td>
222
- <td>47.2</td>
223
- <td>42.8</td>
224
- <td >45.0</td>
225
- <td>54.9</td>
226
- <td>48.9</td>
227
- <td >51.9</td>
228
- <td>58.9</td>
229
- <td>51.6</td>
230
- <td>55.3</td>
231
- <td>6.9</td>
232
- <td>10.2</td>
233
- </tr>
234
- <tr>
235
- <td>InternViT-6B-448px-V1.0</td>
236
- <td>43.6</td>
237
- <td>38.5</td>
238
- <td >41.0</td>
239
- <td>55.4</td>
240
- <td>49.4</td>
241
- <td>52.4</td>
242
- <td>58.1</td>
243
- <td>51.7</td>
244
- <td>54.9</td>
245
- <td>11.3</td>
246
- <td>13.9</td>
247
- </tr>
248
- <tr>
249
- <td>InternViT-6B-448px-V1.2</td>
250
- <td>40.7</td>
251
- <td>36.1</td>
252
- <td >38.4</td>
253
- <td>55.2</td>
254
- <td>48.8</td>
255
- <td>52.0</td>
256
- <td>58.8</td>
257
- <td>51.7</td>
258
- <td>55.2</td>
259
- <td>13.6</td>
260
- <td>16.8</td>
261
- </tr>
262
- <tr>
263
- <td>InternViT-6B-448px-V1.5</td>
264
- <td>40.9</td>
265
- <td>36.3</td>
266
- <td >38.6</td>
267
- <td>55.0</td>
268
- <td>49.1</td>
269
- <td>52.0</td>
270
- <td>58.8</td>
271
- <td>51.5</td>
272
- <td>55.2</td>
273
- <td>13.4</td>
274
- <td>16.6</td>
275
- </tr>
276
- <tr>
277
- <td>InternViT-6B-448px-V2.5</td>
278
- <td>39.4</td>
279
- <td>35.6</td>
280
- <td >37.5</td>
281
- <td>55.4</td>
282
- <td>49.7</td>
283
- <td>52.6</td>
284
- <td>58.6</td>
285
- <td>51.8</td>
286
- <td>55.2</td>
287
- <td>15.1</td>
288
- <td>17.7</td>
289
- </tr>
290
- </tbody>
291
- </table>
292
-
293
- **Note:** Δ<sub>1</sub> represents the gap between head tuning and linear probing, while Δ<sub>2</sub> shows the gap between full tuning and linear probing. A larger Δ value indicates a shift from simple linear features to more complex, nonlinear representations.
294
-
295
- ## Model Usage (Image Embeddings)
296
 
297
  ```python
298
  import torch
@@ -315,17 +106,20 @@ pixel_values = pixel_values.to(torch.bfloat16).cuda()
315
  outputs = model(pixel_values)
316
  ```
317
 
 
 
 
 
318
  ## Citation
319
 
320
  If you find this project useful in your research, please consider citing:
321
 
322
  ```BibTeX
323
-
324
- @article{chen2023internvl,
325
- title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
326
- author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
327
- journal={arXiv preprint arXiv:2312.14238},
328
- year={2023}
329
  }
330
  @article{chen2024far,
331
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
@@ -333,5 +127,10 @@ If you find this project useful in your research, please consider citing:
333
  journal={arXiv preprint arXiv:2404.16821},
334
  year={2024}
335
  }
336
-
 
 
 
 
 
337
  ```
 
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  pipeline_tag: image-feature-extraction
4
+ base_model: OpenGVLab/InternViT-6B-448px-V1-5
5
+ base_model_relation: finetune
6
  ---
7
 
8
  # InternViT-6B-448px-V2_5
9
 
10
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5\]](https://arxiv.org/abs/2404.16821) [\[📜 InternVL 2.5\]](https://github.com/OpenGVLab/InternVL/blob/main/InternVL2_5_report.pdf)
11
 
12
  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
13
 
 
 
 
14
  <div align="center">
15
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
16
+ </div>
17
 
18
+ ## Introduction
 
 
 
19
 
20
+ We are excited to announce the release of `InternViT-6B-448px-V2_5`, a significant enhancement built on the foundation of `InternViT-6B-448px-V1-5`. By employing **ViT incremental learning** with NTP loss (Stage 1.5), the vision encoder has improved its ability to extract visual features, enabling it to capture more comprehensive information. This improvement is particularly noticeable in domains that are underrepresented in large-scale web datasets such as LAION-5B, including multilingual OCR data and mathematical charts, among others.
21
+
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/o9_FX5D8_NOS1gfnebp5s.png)
23
+
24
+ ## InternViT 2.5 Family
25
+
26
+ In the following table, we provide an overview of the InternViT 2.5 series.
27
+
28
+ | Model Name | HF Link |
29
+ | :-----------------------: | :-------------------------------------------------------------------: |
30
+ | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
31
+ | InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
32
+
33
+ ## Model Architecture
34
+
35
+ As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
36
+
37
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/BiiyXN6NOk0p-3rl3ueyL.png)
38
+
39
+ As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.
40
+
41
+ ## Training Strategy
42
+
43
+ ### Dynamic High-Resolution for Multimodal Data
44
+
45
+ In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets.
46
+
47
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/xoMY6rwRrNxbAGYPNyU8g.png)
48
+
49
+ - For single-image datasets, the total number of tiles `n_max` are allocated to a single image for maximum resolution. Visual tokens are enclosed in `<img>` and `</img>` tags.
50
+
51
+ - For multi-image datasets, the total number of tiles `n_max` are distributed across all images in a sample. Each image is labeled with auxiliary tags like `Image-1` and enclosed in `<img>` and `</img>` tags.
52
+
53
+ - For videos, each frame is resized to 448×448. Frames are labeled with tags like `Frame-1` and enclosed in `<img>` and `</img>` tags, similar to images.
54
+
55
+ ### Single Model Training Pipeline
56
+
57
+ The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities.
58
+
59
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5NduZeCPLgPJTFr0RGTq3.png)
60
+
61
+ - **Stage 1: MLP Warmup.** In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training.
62
+
63
+ - **Stage 1.5: ViT Incremental Learning (Optional).** This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced.
64
+
65
+ - **Stage 2: Full Model Instruction Tuning.** The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete.
66
+
67
+ ## Evaluation on Vision Capability
68
 
69
+ We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details.
70
+
71
+ ## Image Classification
72
+
73
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/0Zx1JWB-2kHEfLbboiVy1.png)
74
+
75
+ **Image classification performance across different versions of InternViT.** We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
76
+
77
+ ## Semantic Segmentation Performance
78
+
79
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/XjJx5WSIXjsaQGLPCsQuP.png)
80
+
81
+ **Semantic segmentation performance across different versions of InternViT.** The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. ∆1 represents the gap between head tuning and linear probing, while ∆2 shows the gap between full tuning and linear probing. A larger ∆ value indicates a shift from simple linear features to more complex, nonlinear representations.
82
+
83
+ ## Quick Start
84
+
85
+ > \[!Warning\]
86
+ > 🚨 Note: In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ```python
89
  import torch
 
106
  outputs = model(pixel_values)
107
  ```
108
 
109
+ ## License
110
+
111
+ This project is released under the MIT License.
112
+
113
  ## Citation
114
 
115
  If you find this project useful in your research, please consider citing:
116
 
117
  ```BibTeX
118
+ @article{gao2024mini,
119
+ title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
120
+ author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
121
+ journal={arXiv preprint arXiv:2410.16261},
122
+ year={2024}
 
123
  }
124
  @article{chen2024far,
125
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
 
127
  journal={arXiv preprint arXiv:2404.16821},
128
  year={2024}
129
  }
130
+ @article{chen2023internvl,
131
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
132
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
133
+ journal={arXiv preprint arXiv:2312.14238},
134
+ year={2023}
135
+ }
136
  ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "OpenGVLab/InternViT-6B-448px-V2_5",
3
+ "architectures": [
4
+ "InternVisionModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_intern_vit.InternVisionConfig",
9
+ "AutoModel": "modeling_intern_vit.InternVisionModel"
10
+ },
11
+ "drop_path_rate": 0.0,
12
+ "dropout": 0.0,
13
+ "hidden_act": "gelu",
14
+ "hidden_size": 3200,
15
+ "image_size": 448,
16
+ "initializer_factor": 0.1,
17
+ "initializer_range": 1e-10,
18
+ "intermediate_size": 12800,
19
+ "layer_norm_eps": 1e-06,
20
+ "model_type": "intern_vit_6b",
21
+ "num_attention_heads": 25,
22
+ "num_channels": 3,
23
+ "num_hidden_layers": 45,
24
+ "patch_size": 14,
25
+ "qk_normalization": true,
26
+ "qkv_bias": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.37.2",
29
+ "use_bfloat16": true,
30
+ "use_flash_attn": true
31
+ }
configuration_intern_vit.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ import os
8
+ from typing import Union
9
+
10
+ from transformers.configuration_utils import PretrainedConfig
11
+ from transformers.utils import logging
12
+
13
+ logger = logging.get_logger(__name__)
14
+
15
+
16
+ class InternVisionConfig(PretrainedConfig):
17
+ r"""
18
+ This is the configuration class to store the configuration of a [`InternVisionModel`]. It is used to
19
+ instantiate a vision encoder according to the specified arguments, defining the model architecture.
20
+
21
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
22
+ documentation from [`PretrainedConfig`] for more information.
23
+
24
+ Args:
25
+ num_channels (`int`, *optional*, defaults to 3):
26
+ Number of color channels in the input images (e.g., 3 for RGB).
27
+ patch_size (`int`, *optional*, defaults to 14):
28
+ The size (resolution) of each patch.
29
+ image_size (`int`, *optional*, defaults to 224):
30
+ The size (resolution) of each image.
31
+ qkv_bias (`bool`, *optional*, defaults to `False`):
32
+ Whether to add a bias to the queries and values in the self-attention layers.
33
+ hidden_size (`int`, *optional*, defaults to 3200):
34
+ Dimensionality of the encoder layers and the pooler layer.
35
+ num_attention_heads (`int`, *optional*, defaults to 25):
36
+ Number of attention heads for each attention layer in the Transformer encoder.
37
+ intermediate_size (`int`, *optional*, defaults to 12800):
38
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
39
+ qk_normalization (`bool`, *optional*, defaults to `True`):
40
+ Whether to normalize the queries and keys in the self-attention layers.
41
+ num_hidden_layers (`int`, *optional*, defaults to 48):
42
+ Number of hidden layers in the Transformer encoder.
43
+ use_flash_attn (`bool`, *optional*, defaults to `True`):
44
+ Whether to use flash attention mechanism.
45
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
46
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
47
+ `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
48
+ layer_norm_eps (`float`, *optional*, defaults to 1e-6):
49
+ The epsilon used by the layer normalization layers.
50
+ dropout (`float`, *optional*, defaults to 0.0):
51
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
52
+ drop_path_rate (`float`, *optional*, defaults to 0.0):
53
+ Dropout rate for stochastic depth.
54
+ attention_dropout (`float`, *optional*, defaults to 0.0):
55
+ The dropout ratio for the attention probabilities.
56
+ initializer_range (`float`, *optional*, defaults to 0.02):
57
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
58
+ initializer_factor (`float`, *optional*, defaults to 0.1):
59
+ A factor for layer scale.
60
+ """
61
+
62
+ model_type = 'intern_vit_6b'
63
+
64
+ def __init__(
65
+ self,
66
+ num_channels=3,
67
+ patch_size=14,
68
+ image_size=224,
69
+ qkv_bias=False,
70
+ hidden_size=3200,
71
+ num_attention_heads=25,
72
+ intermediate_size=12800,
73
+ qk_normalization=True,
74
+ num_hidden_layers=48,
75
+ use_flash_attn=True,
76
+ hidden_act='gelu',
77
+ layer_norm_eps=1e-6,
78
+ dropout=0.0,
79
+ drop_path_rate=0.0,
80
+ attention_dropout=0.0,
81
+ initializer_range=0.02,
82
+ initializer_factor=0.1,
83
+ **kwargs,
84
+ ):
85
+ super().__init__(**kwargs)
86
+
87
+ self.hidden_size = hidden_size
88
+ self.intermediate_size = intermediate_size
89
+ self.dropout = dropout
90
+ self.drop_path_rate = drop_path_rate
91
+ self.num_hidden_layers = num_hidden_layers
92
+ self.num_attention_heads = num_attention_heads
93
+ self.num_channels = num_channels
94
+ self.patch_size = patch_size
95
+ self.image_size = image_size
96
+ self.initializer_range = initializer_range
97
+ self.initializer_factor = initializer_factor
98
+ self.attention_dropout = attention_dropout
99
+ self.layer_norm_eps = layer_norm_eps
100
+ self.hidden_act = hidden_act
101
+ self.qkv_bias = qkv_bias
102
+ self.qk_normalization = qk_normalization
103
+ self.use_flash_attn = use_flash_attn
104
+
105
+ @classmethod
106
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> 'PretrainedConfig':
107
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
108
+
109
+ if 'vision_config' in config_dict:
110
+ config_dict = config_dict['vision_config']
111
+
112
+ if 'model_type' in config_dict and hasattr(cls, 'model_type') and config_dict['model_type'] != cls.model_type:
113
+ logger.warning(
114
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
115
+ f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
116
+ )
117
+
118
+ return cls.from_dict(config_dict, **kwargs)
flash_attention.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from einops import rearrange
4
+
5
+ try: # v1
6
+ from flash_attn.flash_attn_interface import \
7
+ flash_attn_unpadded_qkvpacked_func
8
+ except: # v2
9
+ from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
10
+
11
+ from flash_attn.bert_padding import pad_input, unpad_input
12
+
13
+
14
+ class FlashAttention(nn.Module):
15
+ """Implement the scaled dot product attention with softmax.
16
+ Arguments
17
+ ---------
18
+ softmax_scale: The temperature to use for the softmax attention.
19
+ (default: 1/sqrt(d_keys) where d_keys is computed at
20
+ runtime)
21
+ attention_dropout: The dropout rate to apply to the attention
22
+ (default: 0.0)
23
+ """
24
+
25
+ def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
26
+ super().__init__()
27
+ self.softmax_scale = softmax_scale
28
+ self.dropout_p = attention_dropout
29
+
30
+ def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
31
+ max_s=None, need_weights=False):
32
+ """Implements the multihead softmax attention.
33
+ Arguments
34
+ ---------
35
+ qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
36
+ if unpadded: (nnz, 3, h, d)
37
+ key_padding_mask: a bool tensor of shape (B, S)
38
+ """
39
+ assert not need_weights
40
+ assert qkv.dtype in [torch.float16, torch.bfloat16]
41
+ assert qkv.is_cuda
42
+
43
+ if cu_seqlens is None:
44
+ batch_size = qkv.shape[0]
45
+ seqlen = qkv.shape[1]
46
+ if key_padding_mask is None:
47
+ qkv = rearrange(qkv, 'b s ... -> (b s) ...')
48
+ max_s = seqlen
49
+ cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
50
+ device=qkv.device)
51
+ output = flash_attn_unpadded_qkvpacked_func(
52
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
53
+ softmax_scale=self.softmax_scale, causal=causal
54
+ )
55
+ output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
56
+ else:
57
+ nheads = qkv.shape[-2]
58
+ x = rearrange(qkv, 'b s three h d -> b s (three h d)')
59
+ x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
60
+ x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
61
+ output_unpad = flash_attn_unpadded_qkvpacked_func(
62
+ x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
63
+ softmax_scale=self.softmax_scale, causal=causal
64
+ )
65
+ output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
66
+ indices, batch_size, seqlen),
67
+ 'b s (h d) -> b s h d', h=nheads)
68
+ else:
69
+ assert max_s is not None
70
+ output = flash_attn_unpadded_qkvpacked_func(
71
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
72
+ softmax_scale=self.softmax_scale, causal=causal
73
+ )
74
+
75
+ return output, None
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9818659d13d932da8bc0c3b8ee15f5b5d68d8c94d66eb525be566066630111da
3
+ size 4988565944
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f0c10e72d6f6513f421baa6ec843d5508657435059c1d18b6b5fd7789f9d5b7
3
+ size 4937250176
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d21c4fe0bc4af1425cfae1a59a8f5fbb00fde9d8e2888325a60913ac61b0494d
3
+ size 1147238088
model.safetensors.index.json ADDED
@@ -0,0 +1,596 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 11072992000
4
+ },
5
+ "weight_map": {
6
+ "embeddings.class_embedding": "model-00001-of-00003.safetensors",
7
+ "embeddings.patch_embedding.bias": "model-00001-of-00003.safetensors",
8
+ "embeddings.patch_embedding.weight": "model-00001-of-00003.safetensors",
9
+ "embeddings.position_embedding": "model-00001-of-00003.safetensors",
10
+ "encoder.layers.0.attn.k_norm.weight": "model-00001-of-00003.safetensors",
11
+ "encoder.layers.0.attn.proj.bias": "model-00001-of-00003.safetensors",
12
+ "encoder.layers.0.attn.proj.weight": "model-00001-of-00003.safetensors",
13
+ "encoder.layers.0.attn.q_norm.weight": "model-00001-of-00003.safetensors",
14
+ "encoder.layers.0.attn.qkv.weight": "model-00001-of-00003.safetensors",
15
+ "encoder.layers.0.ls1": "model-00001-of-00003.safetensors",
16
+ "encoder.layers.0.ls2": "model-00001-of-00003.safetensors",
17
+ "encoder.layers.0.mlp.fc1.bias": "model-00001-of-00003.safetensors",
18
+ "encoder.layers.0.mlp.fc1.weight": "model-00001-of-00003.safetensors",
19
+ "encoder.layers.0.mlp.fc2.bias": "model-00001-of-00003.safetensors",
20
+ "encoder.layers.0.mlp.fc2.weight": "model-00001-of-00003.safetensors",
21
+ "encoder.layers.0.norm1.weight": "model-00001-of-00003.safetensors",
22
+ "encoder.layers.0.norm2.weight": "model-00001-of-00003.safetensors",
23
+ "encoder.layers.1.attn.k_norm.weight": "model-00001-of-00003.safetensors",
24
+ "encoder.layers.1.attn.proj.bias": "model-00001-of-00003.safetensors",
25
+ "encoder.layers.1.attn.proj.weight": "model-00001-of-00003.safetensors",
26
+ "encoder.layers.1.attn.q_norm.weight": "model-00001-of-00003.safetensors",
27
+ "encoder.layers.1.attn.qkv.weight": "model-00001-of-00003.safetensors",
28
+ "encoder.layers.1.ls1": "model-00001-of-00003.safetensors",
29
+ "encoder.layers.1.ls2": "model-00001-of-00003.safetensors",
30
+ "encoder.layers.1.mlp.fc1.bias": "model-00001-of-00003.safetensors",
31
+ "encoder.layers.1.mlp.fc1.weight": "model-00001-of-00003.safetensors",
32
+ "encoder.layers.1.mlp.fc2.bias": "model-00001-of-00003.safetensors",
33
+ "encoder.layers.1.mlp.fc2.weight": "model-00001-of-00003.safetensors",
34
+ "encoder.layers.1.norm1.weight": "model-00001-of-00003.safetensors",
35
+ "encoder.layers.1.norm2.weight": "model-00001-of-00003.safetensors",
36
+ "encoder.layers.10.attn.k_norm.weight": "model-00001-of-00003.safetensors",
37
+ "encoder.layers.10.attn.proj.bias": "model-00001-of-00003.safetensors",
38
+ "encoder.layers.10.attn.proj.weight": "model-00001-of-00003.safetensors",
39
+ "encoder.layers.10.attn.q_norm.weight": "model-00001-of-00003.safetensors",
40
+ "encoder.layers.10.attn.qkv.weight": "model-00001-of-00003.safetensors",
41
+ "encoder.layers.10.ls1": "model-00001-of-00003.safetensors",
42
+ "encoder.layers.10.ls2": "model-00001-of-00003.safetensors",
43
+ "encoder.layers.10.mlp.fc1.bias": "model-00001-of-00003.safetensors",
44
+ "encoder.layers.10.mlp.fc1.weight": "model-00001-of-00003.safetensors",
45
+ "encoder.layers.10.mlp.fc2.bias": "model-00001-of-00003.safetensors",
46
+ "encoder.layers.10.mlp.fc2.weight": "model-00001-of-00003.safetensors",
47
+ "encoder.layers.10.norm1.weight": "model-00001-of-00003.safetensors",
48
+ "encoder.layers.10.norm2.weight": "model-00001-of-00003.safetensors",
49
+ "encoder.layers.11.attn.k_norm.weight": "model-00001-of-00003.safetensors",
50
+ "encoder.layers.11.attn.proj.bias": "model-00001-of-00003.safetensors",
51
+ "encoder.layers.11.attn.proj.weight": "model-00001-of-00003.safetensors",
52
+ "encoder.layers.11.attn.q_norm.weight": "model-00001-of-00003.safetensors",
53
+ "encoder.layers.11.attn.qkv.weight": "model-00001-of-00003.safetensors",
54
+ "encoder.layers.11.ls1": "model-00001-of-00003.safetensors",
55
+ "encoder.layers.11.ls2": "model-00001-of-00003.safetensors",
56
+ "encoder.layers.11.mlp.fc1.bias": "model-00001-of-00003.safetensors",
57
+ "encoder.layers.11.mlp.fc1.weight": "model-00001-of-00003.safetensors",
58
+ "encoder.layers.11.mlp.fc2.bias": "model-00001-of-00003.safetensors",
59
+ "encoder.layers.11.mlp.fc2.weight": "model-00001-of-00003.safetensors",
60
+ "encoder.layers.11.norm1.weight": "model-00001-of-00003.safetensors",
61
+ "encoder.layers.11.norm2.weight": "model-00001-of-00003.safetensors",
62
+ "encoder.layers.12.attn.k_norm.weight": "model-00001-of-00003.safetensors",
63
+ "encoder.layers.12.attn.proj.bias": "model-00001-of-00003.safetensors",
64
+ "encoder.layers.12.attn.proj.weight": "model-00001-of-00003.safetensors",
65
+ "encoder.layers.12.attn.q_norm.weight": "model-00001-of-00003.safetensors",
66
+ "encoder.layers.12.attn.qkv.weight": "model-00001-of-00003.safetensors",
67
+ "encoder.layers.12.ls1": "model-00001-of-00003.safetensors",
68
+ "encoder.layers.12.ls2": "model-00001-of-00003.safetensors",
69
+ "encoder.layers.12.mlp.fc1.bias": "model-00001-of-00003.safetensors",
70
+ "encoder.layers.12.mlp.fc1.weight": "model-00001-of-00003.safetensors",
71
+ "encoder.layers.12.mlp.fc2.bias": "model-00001-of-00003.safetensors",
72
+ "encoder.layers.12.mlp.fc2.weight": "model-00001-of-00003.safetensors",
73
+ "encoder.layers.12.norm1.weight": "model-00001-of-00003.safetensors",
74
+ "encoder.layers.12.norm2.weight": "model-00001-of-00003.safetensors",
75
+ "encoder.layers.13.attn.k_norm.weight": "model-00001-of-00003.safetensors",
76
+ "encoder.layers.13.attn.proj.bias": "model-00001-of-00003.safetensors",
77
+ "encoder.layers.13.attn.proj.weight": "model-00001-of-00003.safetensors",
78
+ "encoder.layers.13.attn.q_norm.weight": "model-00001-of-00003.safetensors",
79
+ "encoder.layers.13.attn.qkv.weight": "model-00001-of-00003.safetensors",
80
+ "encoder.layers.13.ls1": "model-00001-of-00003.safetensors",
81
+ "encoder.layers.13.ls2": "model-00001-of-00003.safetensors",
82
+ "encoder.layers.13.mlp.fc1.bias": "model-00001-of-00003.safetensors",
83
+ "encoder.layers.13.mlp.fc1.weight": "model-00001-of-00003.safetensors",
84
+ "encoder.layers.13.mlp.fc2.bias": "model-00001-of-00003.safetensors",
85
+ "encoder.layers.13.mlp.fc2.weight": "model-00001-of-00003.safetensors",
86
+ "encoder.layers.13.norm1.weight": "model-00001-of-00003.safetensors",
87
+ "encoder.layers.13.norm2.weight": "model-00001-of-00003.safetensors",
88
+ "encoder.layers.14.attn.k_norm.weight": "model-00001-of-00003.safetensors",
89
+ "encoder.layers.14.attn.proj.bias": "model-00001-of-00003.safetensors",
90
+ "encoder.layers.14.attn.proj.weight": "model-00001-of-00003.safetensors",
91
+ "encoder.layers.14.attn.q_norm.weight": "model-00001-of-00003.safetensors",
92
+ "encoder.layers.14.attn.qkv.weight": "model-00001-of-00003.safetensors",
93
+ "encoder.layers.14.ls1": "model-00001-of-00003.safetensors",
94
+ "encoder.layers.14.ls2": "model-00001-of-00003.safetensors",
95
+ "encoder.layers.14.mlp.fc1.bias": "model-00001-of-00003.safetensors",
96
+ "encoder.layers.14.mlp.fc1.weight": "model-00001-of-00003.safetensors",
97
+ "encoder.layers.14.mlp.fc2.bias": "model-00001-of-00003.safetensors",
98
+ "encoder.layers.14.mlp.fc2.weight": "model-00001-of-00003.safetensors",
99
+ "encoder.layers.14.norm1.weight": "model-00001-of-00003.safetensors",
100
+ "encoder.layers.14.norm2.weight": "model-00001-of-00003.safetensors",
101
+ "encoder.layers.15.attn.k_norm.weight": "model-00001-of-00003.safetensors",
102
+ "encoder.layers.15.attn.proj.bias": "model-00001-of-00003.safetensors",
103
+ "encoder.layers.15.attn.proj.weight": "model-00001-of-00003.safetensors",
104
+ "encoder.layers.15.attn.q_norm.weight": "model-00001-of-00003.safetensors",
105
+ "encoder.layers.15.attn.qkv.weight": "model-00001-of-00003.safetensors",
106
+ "encoder.layers.15.ls1": "model-00001-of-00003.safetensors",
107
+ "encoder.layers.15.ls2": "model-00001-of-00003.safetensors",
108
+ "encoder.layers.15.mlp.fc1.bias": "model-00001-of-00003.safetensors",
109
+ "encoder.layers.15.mlp.fc1.weight": "model-00001-of-00003.safetensors",
110
+ "encoder.layers.15.mlp.fc2.bias": "model-00001-of-00003.safetensors",
111
+ "encoder.layers.15.mlp.fc2.weight": "model-00001-of-00003.safetensors",
112
+ "encoder.layers.15.norm1.weight": "model-00001-of-00003.safetensors",
113
+ "encoder.layers.15.norm2.weight": "model-00001-of-00003.safetensors",
114
+ "encoder.layers.16.attn.k_norm.weight": "model-00001-of-00003.safetensors",
115
+ "encoder.layers.16.attn.proj.bias": "model-00001-of-00003.safetensors",
116
+ "encoder.layers.16.attn.proj.weight": "model-00001-of-00003.safetensors",
117
+ "encoder.layers.16.attn.q_norm.weight": "model-00001-of-00003.safetensors",
118
+ "encoder.layers.16.attn.qkv.weight": "model-00001-of-00003.safetensors",
119
+ "encoder.layers.16.ls1": "model-00001-of-00003.safetensors",
120
+ "encoder.layers.16.ls2": "model-00001-of-00003.safetensors",
121
+ "encoder.layers.16.mlp.fc1.bias": "model-00001-of-00003.safetensors",
122
+ "encoder.layers.16.mlp.fc1.weight": "model-00001-of-00003.safetensors",
123
+ "encoder.layers.16.mlp.fc2.bias": "model-00001-of-00003.safetensors",
124
+ "encoder.layers.16.mlp.fc2.weight": "model-00001-of-00003.safetensors",
125
+ "encoder.layers.16.norm1.weight": "model-00001-of-00003.safetensors",
126
+ "encoder.layers.16.norm2.weight": "model-00001-of-00003.safetensors",
127
+ "encoder.layers.17.attn.k_norm.weight": "model-00001-of-00003.safetensors",
128
+ "encoder.layers.17.attn.proj.bias": "model-00001-of-00003.safetensors",
129
+ "encoder.layers.17.attn.proj.weight": "model-00001-of-00003.safetensors",
130
+ "encoder.layers.17.attn.q_norm.weight": "model-00001-of-00003.safetensors",
131
+ "encoder.layers.17.attn.qkv.weight": "model-00001-of-00003.safetensors",
132
+ "encoder.layers.17.ls1": "model-00001-of-00003.safetensors",
133
+ "encoder.layers.17.ls2": "model-00001-of-00003.safetensors",
134
+ "encoder.layers.17.mlp.fc1.bias": "model-00001-of-00003.safetensors",
135
+ "encoder.layers.17.mlp.fc1.weight": "model-00001-of-00003.safetensors",
136
+ "encoder.layers.17.mlp.fc2.bias": "model-00001-of-00003.safetensors",
137
+ "encoder.layers.17.mlp.fc2.weight": "model-00001-of-00003.safetensors",
138
+ "encoder.layers.17.norm1.weight": "model-00001-of-00003.safetensors",
139
+ "encoder.layers.17.norm2.weight": "model-00001-of-00003.safetensors",
140
+ "encoder.layers.18.attn.k_norm.weight": "model-00001-of-00003.safetensors",
141
+ "encoder.layers.18.attn.proj.bias": "model-00001-of-00003.safetensors",
142
+ "encoder.layers.18.attn.proj.weight": "model-00001-of-00003.safetensors",
143
+ "encoder.layers.18.attn.q_norm.weight": "model-00001-of-00003.safetensors",
144
+ "encoder.layers.18.attn.qkv.weight": "model-00001-of-00003.safetensors",
145
+ "encoder.layers.18.ls1": "model-00001-of-00003.safetensors",
146
+ "encoder.layers.18.ls2": "model-00001-of-00003.safetensors",
147
+ "encoder.layers.18.mlp.fc1.bias": "model-00001-of-00003.safetensors",
148
+ "encoder.layers.18.mlp.fc1.weight": "model-00001-of-00003.safetensors",
149
+ "encoder.layers.18.mlp.fc2.bias": "model-00001-of-00003.safetensors",
150
+ "encoder.layers.18.mlp.fc2.weight": "model-00001-of-00003.safetensors",
151
+ "encoder.layers.18.norm1.weight": "model-00001-of-00003.safetensors",
152
+ "encoder.layers.18.norm2.weight": "model-00001-of-00003.safetensors",
153
+ "encoder.layers.19.attn.k_norm.weight": "model-00001-of-00003.safetensors",
154
+ "encoder.layers.19.attn.proj.bias": "model-00001-of-00003.safetensors",
155
+ "encoder.layers.19.attn.proj.weight": "model-00001-of-00003.safetensors",
156
+ "encoder.layers.19.attn.q_norm.weight": "model-00001-of-00003.safetensors",
157
+ "encoder.layers.19.attn.qkv.weight": "model-00001-of-00003.safetensors",
158
+ "encoder.layers.19.ls1": "model-00001-of-00003.safetensors",
159
+ "encoder.layers.19.ls2": "model-00001-of-00003.safetensors",
160
+ "encoder.layers.19.mlp.fc1.bias": "model-00001-of-00003.safetensors",
161
+ "encoder.layers.19.mlp.fc1.weight": "model-00001-of-00003.safetensors",
162
+ "encoder.layers.19.mlp.fc2.bias": "model-00001-of-00003.safetensors",
163
+ "encoder.layers.19.mlp.fc2.weight": "model-00001-of-00003.safetensors",
164
+ "encoder.layers.19.norm1.weight": "model-00001-of-00003.safetensors",
165
+ "encoder.layers.19.norm2.weight": "model-00001-of-00003.safetensors",
166
+ "encoder.layers.2.attn.k_norm.weight": "model-00001-of-00003.safetensors",
167
+ "encoder.layers.2.attn.proj.bias": "model-00001-of-00003.safetensors",
168
+ "encoder.layers.2.attn.proj.weight": "model-00001-of-00003.safetensors",
169
+ "encoder.layers.2.attn.q_norm.weight": "model-00001-of-00003.safetensors",
170
+ "encoder.layers.2.attn.qkv.weight": "model-00001-of-00003.safetensors",
171
+ "encoder.layers.2.ls1": "model-00001-of-00003.safetensors",
172
+ "encoder.layers.2.ls2": "model-00001-of-00003.safetensors",
173
+ "encoder.layers.2.mlp.fc1.bias": "model-00001-of-00003.safetensors",
174
+ "encoder.layers.2.mlp.fc1.weight": "model-00001-of-00003.safetensors",
175
+ "encoder.layers.2.mlp.fc2.bias": "model-00001-of-00003.safetensors",
176
+ "encoder.layers.2.mlp.fc2.weight": "model-00001-of-00003.safetensors",
177
+ "encoder.layers.2.norm1.weight": "model-00001-of-00003.safetensors",
178
+ "encoder.layers.2.norm2.weight": "model-00001-of-00003.safetensors",
179
+ "encoder.layers.20.attn.k_norm.weight": "model-00001-of-00003.safetensors",
180
+ "encoder.layers.20.attn.proj.bias": "model-00002-of-00003.safetensors",
181
+ "encoder.layers.20.attn.proj.weight": "model-00002-of-00003.safetensors",
182
+ "encoder.layers.20.attn.q_norm.weight": "model-00001-of-00003.safetensors",
183
+ "encoder.layers.20.attn.qkv.weight": "model-00001-of-00003.safetensors",
184
+ "encoder.layers.20.ls1": "model-00001-of-00003.safetensors",
185
+ "encoder.layers.20.ls2": "model-00001-of-00003.safetensors",
186
+ "encoder.layers.20.mlp.fc1.bias": "model-00002-of-00003.safetensors",
187
+ "encoder.layers.20.mlp.fc1.weight": "model-00002-of-00003.safetensors",
188
+ "encoder.layers.20.mlp.fc2.bias": "model-00002-of-00003.safetensors",
189
+ "encoder.layers.20.mlp.fc2.weight": "model-00002-of-00003.safetensors",
190
+ "encoder.layers.20.norm1.weight": "model-00002-of-00003.safetensors",
191
+ "encoder.layers.20.norm2.weight": "model-00002-of-00003.safetensors",
192
+ "encoder.layers.21.attn.k_norm.weight": "model-00002-of-00003.safetensors",
193
+ "encoder.layers.21.attn.proj.bias": "model-00002-of-00003.safetensors",
194
+ "encoder.layers.21.attn.proj.weight": "model-00002-of-00003.safetensors",
195
+ "encoder.layers.21.attn.q_norm.weight": "model-00002-of-00003.safetensors",
196
+ "encoder.layers.21.attn.qkv.weight": "model-00002-of-00003.safetensors",
197
+ "encoder.layers.21.ls1": "model-00002-of-00003.safetensors",
198
+ "encoder.layers.21.ls2": "model-00002-of-00003.safetensors",
199
+ "encoder.layers.21.mlp.fc1.bias": "model-00002-of-00003.safetensors",
200
+ "encoder.layers.21.mlp.fc1.weight": "model-00002-of-00003.safetensors",
201
+ "encoder.layers.21.mlp.fc2.bias": "model-00002-of-00003.safetensors",
202
+ "encoder.layers.21.mlp.fc2.weight": "model-00002-of-00003.safetensors",
203
+ "encoder.layers.21.norm1.weight": "model-00002-of-00003.safetensors",
204
+ "encoder.layers.21.norm2.weight": "model-00002-of-00003.safetensors",
205
+ "encoder.layers.22.attn.k_norm.weight": "model-00002-of-00003.safetensors",
206
+ "encoder.layers.22.attn.proj.bias": "model-00002-of-00003.safetensors",
207
+ "encoder.layers.22.attn.proj.weight": "model-00002-of-00003.safetensors",
208
+ "encoder.layers.22.attn.q_norm.weight": "model-00002-of-00003.safetensors",
209
+ "encoder.layers.22.attn.qkv.weight": "model-00002-of-00003.safetensors",
210
+ "encoder.layers.22.ls1": "model-00002-of-00003.safetensors",
211
+ "encoder.layers.22.ls2": "model-00002-of-00003.safetensors",
212
+ "encoder.layers.22.mlp.fc1.bias": "model-00002-of-00003.safetensors",
213
+ "encoder.layers.22.mlp.fc1.weight": "model-00002-of-00003.safetensors",
214
+ "encoder.layers.22.mlp.fc2.bias": "model-00002-of-00003.safetensors",
215
+ "encoder.layers.22.mlp.fc2.weight": "model-00002-of-00003.safetensors",
216
+ "encoder.layers.22.norm1.weight": "model-00002-of-00003.safetensors",
217
+ "encoder.layers.22.norm2.weight": "model-00002-of-00003.safetensors",
218
+ "encoder.layers.23.attn.k_norm.weight": "model-00002-of-00003.safetensors",
219
+ "encoder.layers.23.attn.proj.bias": "model-00002-of-00003.safetensors",
220
+ "encoder.layers.23.attn.proj.weight": "model-00002-of-00003.safetensors",
221
+ "encoder.layers.23.attn.q_norm.weight": "model-00002-of-00003.safetensors",
222
+ "encoder.layers.23.attn.qkv.weight": "model-00002-of-00003.safetensors",
223
+ "encoder.layers.23.ls1": "model-00002-of-00003.safetensors",
224
+ "encoder.layers.23.ls2": "model-00002-of-00003.safetensors",
225
+ "encoder.layers.23.mlp.fc1.bias": "model-00002-of-00003.safetensors",
226
+ "encoder.layers.23.mlp.fc1.weight": "model-00002-of-00003.safetensors",
227
+ "encoder.layers.23.mlp.fc2.bias": "model-00002-of-00003.safetensors",
228
+ "encoder.layers.23.mlp.fc2.weight": "model-00002-of-00003.safetensors",
229
+ "encoder.layers.23.norm1.weight": "model-00002-of-00003.safetensors",
230
+ "encoder.layers.23.norm2.weight": "model-00002-of-00003.safetensors",
231
+ "encoder.layers.24.attn.k_norm.weight": "model-00002-of-00003.safetensors",
232
+ "encoder.layers.24.attn.proj.bias": "model-00002-of-00003.safetensors",
233
+ "encoder.layers.24.attn.proj.weight": "model-00002-of-00003.safetensors",
234
+ "encoder.layers.24.attn.q_norm.weight": "model-00002-of-00003.safetensors",
235
+ "encoder.layers.24.attn.qkv.weight": "model-00002-of-00003.safetensors",
236
+ "encoder.layers.24.ls1": "model-00002-of-00003.safetensors",
237
+ "encoder.layers.24.ls2": "model-00002-of-00003.safetensors",
238
+ "encoder.layers.24.mlp.fc1.bias": "model-00002-of-00003.safetensors",
239
+ "encoder.layers.24.mlp.fc1.weight": "model-00002-of-00003.safetensors",
240
+ "encoder.layers.24.mlp.fc2.bias": "model-00002-of-00003.safetensors",
241
+ "encoder.layers.24.mlp.fc2.weight": "model-00002-of-00003.safetensors",
242
+ "encoder.layers.24.norm1.weight": "model-00002-of-00003.safetensors",
243
+ "encoder.layers.24.norm2.weight": "model-00002-of-00003.safetensors",
244
+ "encoder.layers.25.attn.k_norm.weight": "model-00002-of-00003.safetensors",
245
+ "encoder.layers.25.attn.proj.bias": "model-00002-of-00003.safetensors",
246
+ "encoder.layers.25.attn.proj.weight": "model-00002-of-00003.safetensors",
247
+ "encoder.layers.25.attn.q_norm.weight": "model-00002-of-00003.safetensors",
248
+ "encoder.layers.25.attn.qkv.weight": "model-00002-of-00003.safetensors",
249
+ "encoder.layers.25.ls1": "model-00002-of-00003.safetensors",
250
+ "encoder.layers.25.ls2": "model-00002-of-00003.safetensors",
251
+ "encoder.layers.25.mlp.fc1.bias": "model-00002-of-00003.safetensors",
252
+ "encoder.layers.25.mlp.fc1.weight": "model-00002-of-00003.safetensors",
253
+ "encoder.layers.25.mlp.fc2.bias": "model-00002-of-00003.safetensors",
254
+ "encoder.layers.25.mlp.fc2.weight": "model-00002-of-00003.safetensors",
255
+ "encoder.layers.25.norm1.weight": "model-00002-of-00003.safetensors",
256
+ "encoder.layers.25.norm2.weight": "model-00002-of-00003.safetensors",
257
+ "encoder.layers.26.attn.k_norm.weight": "model-00002-of-00003.safetensors",
258
+ "encoder.layers.26.attn.proj.bias": "model-00002-of-00003.safetensors",
259
+ "encoder.layers.26.attn.proj.weight": "model-00002-of-00003.safetensors",
260
+ "encoder.layers.26.attn.q_norm.weight": "model-00002-of-00003.safetensors",
261
+ "encoder.layers.26.attn.qkv.weight": "model-00002-of-00003.safetensors",
262
+ "encoder.layers.26.ls1": "model-00002-of-00003.safetensors",
263
+ "encoder.layers.26.ls2": "model-00002-of-00003.safetensors",
264
+ "encoder.layers.26.mlp.fc1.bias": "model-00002-of-00003.safetensors",
265
+ "encoder.layers.26.mlp.fc1.weight": "model-00002-of-00003.safetensors",
266
+ "encoder.layers.26.mlp.fc2.bias": "model-00002-of-00003.safetensors",
267
+ "encoder.layers.26.mlp.fc2.weight": "model-00002-of-00003.safetensors",
268
+ "encoder.layers.26.norm1.weight": "model-00002-of-00003.safetensors",
269
+ "encoder.layers.26.norm2.weight": "model-00002-of-00003.safetensors",
270
+ "encoder.layers.27.attn.k_norm.weight": "model-00002-of-00003.safetensors",
271
+ "encoder.layers.27.attn.proj.bias": "model-00002-of-00003.safetensors",
272
+ "encoder.layers.27.attn.proj.weight": "model-00002-of-00003.safetensors",
273
+ "encoder.layers.27.attn.q_norm.weight": "model-00002-of-00003.safetensors",
274
+ "encoder.layers.27.attn.qkv.weight": "model-00002-of-00003.safetensors",
275
+ "encoder.layers.27.ls1": "model-00002-of-00003.safetensors",
276
+ "encoder.layers.27.ls2": "model-00002-of-00003.safetensors",
277
+ "encoder.layers.27.mlp.fc1.bias": "model-00002-of-00003.safetensors",
278
+ "encoder.layers.27.mlp.fc1.weight": "model-00002-of-00003.safetensors",
279
+ "encoder.layers.27.mlp.fc2.bias": "model-00002-of-00003.safetensors",
280
+ "encoder.layers.27.mlp.fc2.weight": "model-00002-of-00003.safetensors",
281
+ "encoder.layers.27.norm1.weight": "model-00002-of-00003.safetensors",
282
+ "encoder.layers.27.norm2.weight": "model-00002-of-00003.safetensors",
283
+ "encoder.layers.28.attn.k_norm.weight": "model-00002-of-00003.safetensors",
284
+ "encoder.layers.28.attn.proj.bias": "model-00002-of-00003.safetensors",
285
+ "encoder.layers.28.attn.proj.weight": "model-00002-of-00003.safetensors",
286
+ "encoder.layers.28.attn.q_norm.weight": "model-00002-of-00003.safetensors",
287
+ "encoder.layers.28.attn.qkv.weight": "model-00002-of-00003.safetensors",
288
+ "encoder.layers.28.ls1": "model-00002-of-00003.safetensors",
289
+ "encoder.layers.28.ls2": "model-00002-of-00003.safetensors",
290
+ "encoder.layers.28.mlp.fc1.bias": "model-00002-of-00003.safetensors",
291
+ "encoder.layers.28.mlp.fc1.weight": "model-00002-of-00003.safetensors",
292
+ "encoder.layers.28.mlp.fc2.bias": "model-00002-of-00003.safetensors",
293
+ "encoder.layers.28.mlp.fc2.weight": "model-00002-of-00003.safetensors",
294
+ "encoder.layers.28.norm1.weight": "model-00002-of-00003.safetensors",
295
+ "encoder.layers.28.norm2.weight": "model-00002-of-00003.safetensors",
296
+ "encoder.layers.29.attn.k_norm.weight": "model-00002-of-00003.safetensors",
297
+ "encoder.layers.29.attn.proj.bias": "model-00002-of-00003.safetensors",
298
+ "encoder.layers.29.attn.proj.weight": "model-00002-of-00003.safetensors",
299
+ "encoder.layers.29.attn.q_norm.weight": "model-00002-of-00003.safetensors",
300
+ "encoder.layers.29.attn.qkv.weight": "model-00002-of-00003.safetensors",
301
+ "encoder.layers.29.ls1": "model-00002-of-00003.safetensors",
302
+ "encoder.layers.29.ls2": "model-00002-of-00003.safetensors",
303
+ "encoder.layers.29.mlp.fc1.bias": "model-00002-of-00003.safetensors",
304
+ "encoder.layers.29.mlp.fc1.weight": "model-00002-of-00003.safetensors",
305
+ "encoder.layers.29.mlp.fc2.bias": "model-00002-of-00003.safetensors",
306
+ "encoder.layers.29.mlp.fc2.weight": "model-00002-of-00003.safetensors",
307
+ "encoder.layers.29.norm1.weight": "model-00002-of-00003.safetensors",
308
+ "encoder.layers.29.norm2.weight": "model-00002-of-00003.safetensors",
309
+ "encoder.layers.3.attn.k_norm.weight": "model-00001-of-00003.safetensors",
310
+ "encoder.layers.3.attn.proj.bias": "model-00001-of-00003.safetensors",
311
+ "encoder.layers.3.attn.proj.weight": "model-00001-of-00003.safetensors",
312
+ "encoder.layers.3.attn.q_norm.weight": "model-00001-of-00003.safetensors",
313
+ "encoder.layers.3.attn.qkv.weight": "model-00001-of-00003.safetensors",
314
+ "encoder.layers.3.ls1": "model-00001-of-00003.safetensors",
315
+ "encoder.layers.3.ls2": "model-00001-of-00003.safetensors",
316
+ "encoder.layers.3.mlp.fc1.bias": "model-00001-of-00003.safetensors",
317
+ "encoder.layers.3.mlp.fc1.weight": "model-00001-of-00003.safetensors",
318
+ "encoder.layers.3.mlp.fc2.bias": "model-00001-of-00003.safetensors",
319
+ "encoder.layers.3.mlp.fc2.weight": "model-00001-of-00003.safetensors",
320
+ "encoder.layers.3.norm1.weight": "model-00001-of-00003.safetensors",
321
+ "encoder.layers.3.norm2.weight": "model-00001-of-00003.safetensors",
322
+ "encoder.layers.30.attn.k_norm.weight": "model-00002-of-00003.safetensors",
323
+ "encoder.layers.30.attn.proj.bias": "model-00002-of-00003.safetensors",
324
+ "encoder.layers.30.attn.proj.weight": "model-00002-of-00003.safetensors",
325
+ "encoder.layers.30.attn.q_norm.weight": "model-00002-of-00003.safetensors",
326
+ "encoder.layers.30.attn.qkv.weight": "model-00002-of-00003.safetensors",
327
+ "encoder.layers.30.ls1": "model-00002-of-00003.safetensors",
328
+ "encoder.layers.30.ls2": "model-00002-of-00003.safetensors",
329
+ "encoder.layers.30.mlp.fc1.bias": "model-00002-of-00003.safetensors",
330
+ "encoder.layers.30.mlp.fc1.weight": "model-00002-of-00003.safetensors",
331
+ "encoder.layers.30.mlp.fc2.bias": "model-00002-of-00003.safetensors",
332
+ "encoder.layers.30.mlp.fc2.weight": "model-00002-of-00003.safetensors",
333
+ "encoder.layers.30.norm1.weight": "model-00002-of-00003.safetensors",
334
+ "encoder.layers.30.norm2.weight": "model-00002-of-00003.safetensors",
335
+ "encoder.layers.31.attn.k_norm.weight": "model-00002-of-00003.safetensors",
336
+ "encoder.layers.31.attn.proj.bias": "model-00002-of-00003.safetensors",
337
+ "encoder.layers.31.attn.proj.weight": "model-00002-of-00003.safetensors",
338
+ "encoder.layers.31.attn.q_norm.weight": "model-00002-of-00003.safetensors",
339
+ "encoder.layers.31.attn.qkv.weight": "model-00002-of-00003.safetensors",
340
+ "encoder.layers.31.ls1": "model-00002-of-00003.safetensors",
341
+ "encoder.layers.31.ls2": "model-00002-of-00003.safetensors",
342
+ "encoder.layers.31.mlp.fc1.bias": "model-00002-of-00003.safetensors",
343
+ "encoder.layers.31.mlp.fc1.weight": "model-00002-of-00003.safetensors",
344
+ "encoder.layers.31.mlp.fc2.bias": "model-00002-of-00003.safetensors",
345
+ "encoder.layers.31.mlp.fc2.weight": "model-00002-of-00003.safetensors",
346
+ "encoder.layers.31.norm1.weight": "model-00002-of-00003.safetensors",
347
+ "encoder.layers.31.norm2.weight": "model-00002-of-00003.safetensors",
348
+ "encoder.layers.32.attn.k_norm.weight": "model-00002-of-00003.safetensors",
349
+ "encoder.layers.32.attn.proj.bias": "model-00002-of-00003.safetensors",
350
+ "encoder.layers.32.attn.proj.weight": "model-00002-of-00003.safetensors",
351
+ "encoder.layers.32.attn.q_norm.weight": "model-00002-of-00003.safetensors",
352
+ "encoder.layers.32.attn.qkv.weight": "model-00002-of-00003.safetensors",
353
+ "encoder.layers.32.ls1": "model-00002-of-00003.safetensors",
354
+ "encoder.layers.32.ls2": "model-00002-of-00003.safetensors",
355
+ "encoder.layers.32.mlp.fc1.bias": "model-00002-of-00003.safetensors",
356
+ "encoder.layers.32.mlp.fc1.weight": "model-00002-of-00003.safetensors",
357
+ "encoder.layers.32.mlp.fc2.bias": "model-00002-of-00003.safetensors",
358
+ "encoder.layers.32.mlp.fc2.weight": "model-00002-of-00003.safetensors",
359
+ "encoder.layers.32.norm1.weight": "model-00002-of-00003.safetensors",
360
+ "encoder.layers.32.norm2.weight": "model-00002-of-00003.safetensors",
361
+ "encoder.layers.33.attn.k_norm.weight": "model-00002-of-00003.safetensors",
362
+ "encoder.layers.33.attn.proj.bias": "model-00002-of-00003.safetensors",
363
+ "encoder.layers.33.attn.proj.weight": "model-00002-of-00003.safetensors",
364
+ "encoder.layers.33.attn.q_norm.weight": "model-00002-of-00003.safetensors",
365
+ "encoder.layers.33.attn.qkv.weight": "model-00002-of-00003.safetensors",
366
+ "encoder.layers.33.ls1": "model-00002-of-00003.safetensors",
367
+ "encoder.layers.33.ls2": "model-00002-of-00003.safetensors",
368
+ "encoder.layers.33.mlp.fc1.bias": "model-00002-of-00003.safetensors",
369
+ "encoder.layers.33.mlp.fc1.weight": "model-00002-of-00003.safetensors",
370
+ "encoder.layers.33.mlp.fc2.bias": "model-00002-of-00003.safetensors",
371
+ "encoder.layers.33.mlp.fc2.weight": "model-00002-of-00003.safetensors",
372
+ "encoder.layers.33.norm1.weight": "model-00002-of-00003.safetensors",
373
+ "encoder.layers.33.norm2.weight": "model-00002-of-00003.safetensors",
374
+ "encoder.layers.34.attn.k_norm.weight": "model-00002-of-00003.safetensors",
375
+ "encoder.layers.34.attn.proj.bias": "model-00002-of-00003.safetensors",
376
+ "encoder.layers.34.attn.proj.weight": "model-00002-of-00003.safetensors",
377
+ "encoder.layers.34.attn.q_norm.weight": "model-00002-of-00003.safetensors",
378
+ "encoder.layers.34.attn.qkv.weight": "model-00002-of-00003.safetensors",
379
+ "encoder.layers.34.ls1": "model-00002-of-00003.safetensors",
380
+ "encoder.layers.34.ls2": "model-00002-of-00003.safetensors",
381
+ "encoder.layers.34.mlp.fc1.bias": "model-00002-of-00003.safetensors",
382
+ "encoder.layers.34.mlp.fc1.weight": "model-00002-of-00003.safetensors",
383
+ "encoder.layers.34.mlp.fc2.bias": "model-00002-of-00003.safetensors",
384
+ "encoder.layers.34.mlp.fc2.weight": "model-00002-of-00003.safetensors",
385
+ "encoder.layers.34.norm1.weight": "model-00002-of-00003.safetensors",
386
+ "encoder.layers.34.norm2.weight": "model-00002-of-00003.safetensors",
387
+ "encoder.layers.35.attn.k_norm.weight": "model-00002-of-00003.safetensors",
388
+ "encoder.layers.35.attn.proj.bias": "model-00002-of-00003.safetensors",
389
+ "encoder.layers.35.attn.proj.weight": "model-00002-of-00003.safetensors",
390
+ "encoder.layers.35.attn.q_norm.weight": "model-00002-of-00003.safetensors",
391
+ "encoder.layers.35.attn.qkv.weight": "model-00002-of-00003.safetensors",
392
+ "encoder.layers.35.ls1": "model-00002-of-00003.safetensors",
393
+ "encoder.layers.35.ls2": "model-00002-of-00003.safetensors",
394
+ "encoder.layers.35.mlp.fc1.bias": "model-00002-of-00003.safetensors",
395
+ "encoder.layers.35.mlp.fc1.weight": "model-00002-of-00003.safetensors",
396
+ "encoder.layers.35.mlp.fc2.bias": "model-00002-of-00003.safetensors",
397
+ "encoder.layers.35.mlp.fc2.weight": "model-00002-of-00003.safetensors",
398
+ "encoder.layers.35.norm1.weight": "model-00002-of-00003.safetensors",
399
+ "encoder.layers.35.norm2.weight": "model-00002-of-00003.safetensors",
400
+ "encoder.layers.36.attn.k_norm.weight": "model-00002-of-00003.safetensors",
401
+ "encoder.layers.36.attn.proj.bias": "model-00002-of-00003.safetensors",
402
+ "encoder.layers.36.attn.proj.weight": "model-00002-of-00003.safetensors",
403
+ "encoder.layers.36.attn.q_norm.weight": "model-00002-of-00003.safetensors",
404
+ "encoder.layers.36.attn.qkv.weight": "model-00002-of-00003.safetensors",
405
+ "encoder.layers.36.ls1": "model-00002-of-00003.safetensors",
406
+ "encoder.layers.36.ls2": "model-00002-of-00003.safetensors",
407
+ "encoder.layers.36.mlp.fc1.bias": "model-00002-of-00003.safetensors",
408
+ "encoder.layers.36.mlp.fc1.weight": "model-00002-of-00003.safetensors",
409
+ "encoder.layers.36.mlp.fc2.bias": "model-00002-of-00003.safetensors",
410
+ "encoder.layers.36.mlp.fc2.weight": "model-00002-of-00003.safetensors",
411
+ "encoder.layers.36.norm1.weight": "model-00002-of-00003.safetensors",
412
+ "encoder.layers.36.norm2.weight": "model-00002-of-00003.safetensors",
413
+ "encoder.layers.37.attn.k_norm.weight": "model-00002-of-00003.safetensors",
414
+ "encoder.layers.37.attn.proj.bias": "model-00002-of-00003.safetensors",
415
+ "encoder.layers.37.attn.proj.weight": "model-00002-of-00003.safetensors",
416
+ "encoder.layers.37.attn.q_norm.weight": "model-00002-of-00003.safetensors",
417
+ "encoder.layers.37.attn.qkv.weight": "model-00002-of-00003.safetensors",
418
+ "encoder.layers.37.ls1": "model-00002-of-00003.safetensors",
419
+ "encoder.layers.37.ls2": "model-00002-of-00003.safetensors",
420
+ "encoder.layers.37.mlp.fc1.bias": "model-00002-of-00003.safetensors",
421
+ "encoder.layers.37.mlp.fc1.weight": "model-00002-of-00003.safetensors",
422
+ "encoder.layers.37.mlp.fc2.bias": "model-00002-of-00003.safetensors",
423
+ "encoder.layers.37.mlp.fc2.weight": "model-00002-of-00003.safetensors",
424
+ "encoder.layers.37.norm1.weight": "model-00002-of-00003.safetensors",
425
+ "encoder.layers.37.norm2.weight": "model-00002-of-00003.safetensors",
426
+ "encoder.layers.38.attn.k_norm.weight": "model-00002-of-00003.safetensors",
427
+ "encoder.layers.38.attn.proj.bias": "model-00002-of-00003.safetensors",
428
+ "encoder.layers.38.attn.proj.weight": "model-00002-of-00003.safetensors",
429
+ "encoder.layers.38.attn.q_norm.weight": "model-00002-of-00003.safetensors",
430
+ "encoder.layers.38.attn.qkv.weight": "model-00002-of-00003.safetensors",
431
+ "encoder.layers.38.ls1": "model-00002-of-00003.safetensors",
432
+ "encoder.layers.38.ls2": "model-00002-of-00003.safetensors",
433
+ "encoder.layers.38.mlp.fc1.bias": "model-00002-of-00003.safetensors",
434
+ "encoder.layers.38.mlp.fc1.weight": "model-00002-of-00003.safetensors",
435
+ "encoder.layers.38.mlp.fc2.bias": "model-00002-of-00003.safetensors",
436
+ "encoder.layers.38.mlp.fc2.weight": "model-00002-of-00003.safetensors",
437
+ "encoder.layers.38.norm1.weight": "model-00002-of-00003.safetensors",
438
+ "encoder.layers.38.norm2.weight": "model-00002-of-00003.safetensors",
439
+ "encoder.layers.39.attn.k_norm.weight": "model-00002-of-00003.safetensors",
440
+ "encoder.layers.39.attn.proj.bias": "model-00002-of-00003.safetensors",
441
+ "encoder.layers.39.attn.proj.weight": "model-00002-of-00003.safetensors",
442
+ "encoder.layers.39.attn.q_norm.weight": "model-00002-of-00003.safetensors",
443
+ "encoder.layers.39.attn.qkv.weight": "model-00002-of-00003.safetensors",
444
+ "encoder.layers.39.ls1": "model-00002-of-00003.safetensors",
445
+ "encoder.layers.39.ls2": "model-00002-of-00003.safetensors",
446
+ "encoder.layers.39.mlp.fc1.bias": "model-00002-of-00003.safetensors",
447
+ "encoder.layers.39.mlp.fc1.weight": "model-00002-of-00003.safetensors",
448
+ "encoder.layers.39.mlp.fc2.bias": "model-00002-of-00003.safetensors",
449
+ "encoder.layers.39.mlp.fc2.weight": "model-00002-of-00003.safetensors",
450
+ "encoder.layers.39.norm1.weight": "model-00002-of-00003.safetensors",
451
+ "encoder.layers.39.norm2.weight": "model-00002-of-00003.safetensors",
452
+ "encoder.layers.4.attn.k_norm.weight": "model-00001-of-00003.safetensors",
453
+ "encoder.layers.4.attn.proj.bias": "model-00001-of-00003.safetensors",
454
+ "encoder.layers.4.attn.proj.weight": "model-00001-of-00003.safetensors",
455
+ "encoder.layers.4.attn.q_norm.weight": "model-00001-of-00003.safetensors",
456
+ "encoder.layers.4.attn.qkv.weight": "model-00001-of-00003.safetensors",
457
+ "encoder.layers.4.ls1": "model-00001-of-00003.safetensors",
458
+ "encoder.layers.4.ls2": "model-00001-of-00003.safetensors",
459
+ "encoder.layers.4.mlp.fc1.bias": "model-00001-of-00003.safetensors",
460
+ "encoder.layers.4.mlp.fc1.weight": "model-00001-of-00003.safetensors",
461
+ "encoder.layers.4.mlp.fc2.bias": "model-00001-of-00003.safetensors",
462
+ "encoder.layers.4.mlp.fc2.weight": "model-00001-of-00003.safetensors",
463
+ "encoder.layers.4.norm1.weight": "model-00001-of-00003.safetensors",
464
+ "encoder.layers.4.norm2.weight": "model-00001-of-00003.safetensors",
465
+ "encoder.layers.40.attn.k_norm.weight": "model-00002-of-00003.safetensors",
466
+ "encoder.layers.40.attn.proj.bias": "model-00002-of-00003.safetensors",
467
+ "encoder.layers.40.attn.proj.weight": "model-00002-of-00003.safetensors",
468
+ "encoder.layers.40.attn.q_norm.weight": "model-00002-of-00003.safetensors",
469
+ "encoder.layers.40.attn.qkv.weight": "model-00002-of-00003.safetensors",
470
+ "encoder.layers.40.ls1": "model-00002-of-00003.safetensors",
471
+ "encoder.layers.40.ls2": "model-00002-of-00003.safetensors",
472
+ "encoder.layers.40.mlp.fc1.bias": "model-00003-of-00003.safetensors",
473
+ "encoder.layers.40.mlp.fc1.weight": "model-00003-of-00003.safetensors",
474
+ "encoder.layers.40.mlp.fc2.bias": "model-00003-of-00003.safetensors",
475
+ "encoder.layers.40.mlp.fc2.weight": "model-00003-of-00003.safetensors",
476
+ "encoder.layers.40.norm1.weight": "model-00003-of-00003.safetensors",
477
+ "encoder.layers.40.norm2.weight": "model-00003-of-00003.safetensors",
478
+ "encoder.layers.41.attn.k_norm.weight": "model-00003-of-00003.safetensors",
479
+ "encoder.layers.41.attn.proj.bias": "model-00003-of-00003.safetensors",
480
+ "encoder.layers.41.attn.proj.weight": "model-00003-of-00003.safetensors",
481
+ "encoder.layers.41.attn.q_norm.weight": "model-00003-of-00003.safetensors",
482
+ "encoder.layers.41.attn.qkv.weight": "model-00003-of-00003.safetensors",
483
+ "encoder.layers.41.ls1": "model-00003-of-00003.safetensors",
484
+ "encoder.layers.41.ls2": "model-00003-of-00003.safetensors",
485
+ "encoder.layers.41.mlp.fc1.bias": "model-00003-of-00003.safetensors",
486
+ "encoder.layers.41.mlp.fc1.weight": "model-00003-of-00003.safetensors",
487
+ "encoder.layers.41.mlp.fc2.bias": "model-00003-of-00003.safetensors",
488
+ "encoder.layers.41.mlp.fc2.weight": "model-00003-of-00003.safetensors",
489
+ "encoder.layers.41.norm1.weight": "model-00003-of-00003.safetensors",
490
+ "encoder.layers.41.norm2.weight": "model-00003-of-00003.safetensors",
491
+ "encoder.layers.42.attn.k_norm.weight": "model-00003-of-00003.safetensors",
492
+ "encoder.layers.42.attn.proj.bias": "model-00003-of-00003.safetensors",
493
+ "encoder.layers.42.attn.proj.weight": "model-00003-of-00003.safetensors",
494
+ "encoder.layers.42.attn.q_norm.weight": "model-00003-of-00003.safetensors",
495
+ "encoder.layers.42.attn.qkv.weight": "model-00003-of-00003.safetensors",
496
+ "encoder.layers.42.ls1": "model-00003-of-00003.safetensors",
497
+ "encoder.layers.42.ls2": "model-00003-of-00003.safetensors",
498
+ "encoder.layers.42.mlp.fc1.bias": "model-00003-of-00003.safetensors",
499
+ "encoder.layers.42.mlp.fc1.weight": "model-00003-of-00003.safetensors",
500
+ "encoder.layers.42.mlp.fc2.bias": "model-00003-of-00003.safetensors",
501
+ "encoder.layers.42.mlp.fc2.weight": "model-00003-of-00003.safetensors",
502
+ "encoder.layers.42.norm1.weight": "model-00003-of-00003.safetensors",
503
+ "encoder.layers.42.norm2.weight": "model-00003-of-00003.safetensors",
504
+ "encoder.layers.43.attn.k_norm.weight": "model-00003-of-00003.safetensors",
505
+ "encoder.layers.43.attn.proj.bias": "model-00003-of-00003.safetensors",
506
+ "encoder.layers.43.attn.proj.weight": "model-00003-of-00003.safetensors",
507
+ "encoder.layers.43.attn.q_norm.weight": "model-00003-of-00003.safetensors",
508
+ "encoder.layers.43.attn.qkv.weight": "model-00003-of-00003.safetensors",
509
+ "encoder.layers.43.ls1": "model-00003-of-00003.safetensors",
510
+ "encoder.layers.43.ls2": "model-00003-of-00003.safetensors",
511
+ "encoder.layers.43.mlp.fc1.bias": "model-00003-of-00003.safetensors",
512
+ "encoder.layers.43.mlp.fc1.weight": "model-00003-of-00003.safetensors",
513
+ "encoder.layers.43.mlp.fc2.bias": "model-00003-of-00003.safetensors",
514
+ "encoder.layers.43.mlp.fc2.weight": "model-00003-of-00003.safetensors",
515
+ "encoder.layers.43.norm1.weight": "model-00003-of-00003.safetensors",
516
+ "encoder.layers.43.norm2.weight": "model-00003-of-00003.safetensors",
517
+ "encoder.layers.44.attn.k_norm.weight": "model-00003-of-00003.safetensors",
518
+ "encoder.layers.44.attn.proj.bias": "model-00003-of-00003.safetensors",
519
+ "encoder.layers.44.attn.proj.weight": "model-00003-of-00003.safetensors",
520
+ "encoder.layers.44.attn.q_norm.weight": "model-00003-of-00003.safetensors",
521
+ "encoder.layers.44.attn.qkv.weight": "model-00003-of-00003.safetensors",
522
+ "encoder.layers.44.ls1": "model-00003-of-00003.safetensors",
523
+ "encoder.layers.44.ls2": "model-00003-of-00003.safetensors",
524
+ "encoder.layers.44.mlp.fc1.bias": "model-00003-of-00003.safetensors",
525
+ "encoder.layers.44.mlp.fc1.weight": "model-00003-of-00003.safetensors",
526
+ "encoder.layers.44.mlp.fc2.bias": "model-00003-of-00003.safetensors",
527
+ "encoder.layers.44.mlp.fc2.weight": "model-00003-of-00003.safetensors",
528
+ "encoder.layers.44.norm1.weight": "model-00003-of-00003.safetensors",
529
+ "encoder.layers.44.norm2.weight": "model-00003-of-00003.safetensors",
530
+ "encoder.layers.5.attn.k_norm.weight": "model-00001-of-00003.safetensors",
531
+ "encoder.layers.5.attn.proj.bias": "model-00001-of-00003.safetensors",
532
+ "encoder.layers.5.attn.proj.weight": "model-00001-of-00003.safetensors",
533
+ "encoder.layers.5.attn.q_norm.weight": "model-00001-of-00003.safetensors",
534
+ "encoder.layers.5.attn.qkv.weight": "model-00001-of-00003.safetensors",
535
+ "encoder.layers.5.ls1": "model-00001-of-00003.safetensors",
536
+ "encoder.layers.5.ls2": "model-00001-of-00003.safetensors",
537
+ "encoder.layers.5.mlp.fc1.bias": "model-00001-of-00003.safetensors",
538
+ "encoder.layers.5.mlp.fc1.weight": "model-00001-of-00003.safetensors",
539
+ "encoder.layers.5.mlp.fc2.bias": "model-00001-of-00003.safetensors",
540
+ "encoder.layers.5.mlp.fc2.weight": "model-00001-of-00003.safetensors",
541
+ "encoder.layers.5.norm1.weight": "model-00001-of-00003.safetensors",
542
+ "encoder.layers.5.norm2.weight": "model-00001-of-00003.safetensors",
543
+ "encoder.layers.6.attn.k_norm.weight": "model-00001-of-00003.safetensors",
544
+ "encoder.layers.6.attn.proj.bias": "model-00001-of-00003.safetensors",
545
+ "encoder.layers.6.attn.proj.weight": "model-00001-of-00003.safetensors",
546
+ "encoder.layers.6.attn.q_norm.weight": "model-00001-of-00003.safetensors",
547
+ "encoder.layers.6.attn.qkv.weight": "model-00001-of-00003.safetensors",
548
+ "encoder.layers.6.ls1": "model-00001-of-00003.safetensors",
549
+ "encoder.layers.6.ls2": "model-00001-of-00003.safetensors",
550
+ "encoder.layers.6.mlp.fc1.bias": "model-00001-of-00003.safetensors",
551
+ "encoder.layers.6.mlp.fc1.weight": "model-00001-of-00003.safetensors",
552
+ "encoder.layers.6.mlp.fc2.bias": "model-00001-of-00003.safetensors",
553
+ "encoder.layers.6.mlp.fc2.weight": "model-00001-of-00003.safetensors",
554
+ "encoder.layers.6.norm1.weight": "model-00001-of-00003.safetensors",
555
+ "encoder.layers.6.norm2.weight": "model-00001-of-00003.safetensors",
556
+ "encoder.layers.7.attn.k_norm.weight": "model-00001-of-00003.safetensors",
557
+ "encoder.layers.7.attn.proj.bias": "model-00001-of-00003.safetensors",
558
+ "encoder.layers.7.attn.proj.weight": "model-00001-of-00003.safetensors",
559
+ "encoder.layers.7.attn.q_norm.weight": "model-00001-of-00003.safetensors",
560
+ "encoder.layers.7.attn.qkv.weight": "model-00001-of-00003.safetensors",
561
+ "encoder.layers.7.ls1": "model-00001-of-00003.safetensors",
562
+ "encoder.layers.7.ls2": "model-00001-of-00003.safetensors",
563
+ "encoder.layers.7.mlp.fc1.bias": "model-00001-of-00003.safetensors",
564
+ "encoder.layers.7.mlp.fc1.weight": "model-00001-of-00003.safetensors",
565
+ "encoder.layers.7.mlp.fc2.bias": "model-00001-of-00003.safetensors",
566
+ "encoder.layers.7.mlp.fc2.weight": "model-00001-of-00003.safetensors",
567
+ "encoder.layers.7.norm1.weight": "model-00001-of-00003.safetensors",
568
+ "encoder.layers.7.norm2.weight": "model-00001-of-00003.safetensors",
569
+ "encoder.layers.8.attn.k_norm.weight": "model-00001-of-00003.safetensors",
570
+ "encoder.layers.8.attn.proj.bias": "model-00001-of-00003.safetensors",
571
+ "encoder.layers.8.attn.proj.weight": "model-00001-of-00003.safetensors",
572
+ "encoder.layers.8.attn.q_norm.weight": "model-00001-of-00003.safetensors",
573
+ "encoder.layers.8.attn.qkv.weight": "model-00001-of-00003.safetensors",
574
+ "encoder.layers.8.ls1": "model-00001-of-00003.safetensors",
575
+ "encoder.layers.8.ls2": "model-00001-of-00003.safetensors",
576
+ "encoder.layers.8.mlp.fc1.bias": "model-00001-of-00003.safetensors",
577
+ "encoder.layers.8.mlp.fc1.weight": "model-00001-of-00003.safetensors",
578
+ "encoder.layers.8.mlp.fc2.bias": "model-00001-of-00003.safetensors",
579
+ "encoder.layers.8.mlp.fc2.weight": "model-00001-of-00003.safetensors",
580
+ "encoder.layers.8.norm1.weight": "model-00001-of-00003.safetensors",
581
+ "encoder.layers.8.norm2.weight": "model-00001-of-00003.safetensors",
582
+ "encoder.layers.9.attn.k_norm.weight": "model-00001-of-00003.safetensors",
583
+ "encoder.layers.9.attn.proj.bias": "model-00001-of-00003.safetensors",
584
+ "encoder.layers.9.attn.proj.weight": "model-00001-of-00003.safetensors",
585
+ "encoder.layers.9.attn.q_norm.weight": "model-00001-of-00003.safetensors",
586
+ "encoder.layers.9.attn.qkv.weight": "model-00001-of-00003.safetensors",
587
+ "encoder.layers.9.ls1": "model-00001-of-00003.safetensors",
588
+ "encoder.layers.9.ls2": "model-00001-of-00003.safetensors",
589
+ "encoder.layers.9.mlp.fc1.bias": "model-00001-of-00003.safetensors",
590
+ "encoder.layers.9.mlp.fc1.weight": "model-00001-of-00003.safetensors",
591
+ "encoder.layers.9.mlp.fc2.bias": "model-00001-of-00003.safetensors",
592
+ "encoder.layers.9.mlp.fc2.weight": "model-00001-of-00003.safetensors",
593
+ "encoder.layers.9.norm1.weight": "model-00001-of-00003.safetensors",
594
+ "encoder.layers.9.norm2.weight": "model-00001-of-00003.safetensors"
595
+ }
596
+ }
modeling_intern_vit.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ from typing import Optional, Tuple, Union
8
+
9
+ import torch
10
+ import torch.nn.functional as F
11
+ import torch.utils.checkpoint
12
+ from einops import rearrange
13
+ from timm.models.layers import DropPath
14
+ from torch import nn
15
+ from transformers.activations import ACT2FN
16
+ from transformers.modeling_outputs import (BaseModelOutput,
17
+ BaseModelOutputWithPooling)
18
+ from transformers.modeling_utils import PreTrainedModel
19
+ from transformers.utils import logging
20
+
21
+ from .configuration_intern_vit import InternVisionConfig
22
+
23
+ try:
24
+ from .flash_attention import FlashAttention
25
+ has_flash_attn = True
26
+ except:
27
+ print('FlashAttention is not installed.')
28
+ has_flash_attn = False
29
+
30
+
31
+ logger = logging.get_logger(__name__)
32
+
33
+
34
+ class InternRMSNorm(nn.Module):
35
+ def __init__(self, hidden_size, eps=1e-6):
36
+ super().__init__()
37
+ self.weight = nn.Parameter(torch.ones(hidden_size))
38
+ self.variance_epsilon = eps
39
+
40
+ def forward(self, hidden_states):
41
+ input_dtype = hidden_states.dtype
42
+ hidden_states = hidden_states.to(torch.float32)
43
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
44
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
45
+ return self.weight * hidden_states.to(input_dtype)
46
+
47
+
48
+ try:
49
+ from apex.normalization import FusedRMSNorm
50
+
51
+ InternRMSNorm = FusedRMSNorm # noqa
52
+
53
+ logger.info('Discovered apex.normalization.FusedRMSNorm - will use it instead of InternRMSNorm')
54
+ except ImportError:
55
+ # using the normal InternRMSNorm
56
+ pass
57
+ except Exception:
58
+ logger.warning('discovered apex but it failed to load, falling back to InternRMSNorm')
59
+ pass
60
+
61
+
62
+ class InternVisionEmbeddings(nn.Module):
63
+ def __init__(self, config: InternVisionConfig):
64
+ super().__init__()
65
+ self.config = config
66
+ self.embed_dim = config.hidden_size
67
+ self.image_size = config.image_size
68
+ self.patch_size = config.patch_size
69
+
70
+ self.class_embedding = nn.Parameter(
71
+ torch.randn(1, 1, self.embed_dim),
72
+ )
73
+
74
+ self.patch_embedding = nn.Conv2d(
75
+ in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
76
+ )
77
+
78
+ self.num_patches = (self.image_size // self.patch_size) ** 2
79
+ self.num_positions = self.num_patches + 1
80
+
81
+ self.position_embedding = nn.Parameter(torch.randn(1, self.num_positions, self.embed_dim))
82
+
83
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
84
+ batch_size = pixel_values.shape[0]
85
+ target_dtype = self.patch_embedding.weight.dtype
86
+ patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
87
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
88
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)
89
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
90
+ embeddings = embeddings + self.position_embedding.to(target_dtype)
91
+ return embeddings
92
+
93
+
94
+ class InternAttention(nn.Module):
95
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
96
+
97
+ def __init__(self, config: InternVisionConfig):
98
+ super().__init__()
99
+ self.config = config
100
+ self.embed_dim = config.hidden_size
101
+ self.num_heads = config.num_attention_heads
102
+ self.use_flash_attn = config.use_flash_attn and has_flash_attn
103
+ if config.use_flash_attn and not has_flash_attn:
104
+ print('Warning: Flash Attention is not available, use_flash_attn is set to False.')
105
+ self.head_dim = self.embed_dim // self.num_heads
106
+ if self.head_dim * self.num_heads != self.embed_dim:
107
+ raise ValueError(
108
+ f'embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:'
109
+ f' {self.num_heads}).'
110
+ )
111
+
112
+ self.scale = self.head_dim ** -0.5
113
+ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=config.qkv_bias)
114
+ self.attn_drop = nn.Dropout(config.attention_dropout)
115
+ self.proj_drop = nn.Dropout(config.dropout)
116
+
117
+ self.qk_normalization = config.qk_normalization
118
+
119
+ if self.qk_normalization:
120
+ self.q_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
121
+ self.k_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
122
+
123
+ if self.use_flash_attn:
124
+ self.inner_attn = FlashAttention(attention_dropout=config.attention_dropout)
125
+ self.proj = nn.Linear(self.embed_dim, self.embed_dim)
126
+
127
+ def _naive_attn(self, x):
128
+ B, N, C = x.shape
129
+ qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
130
+ q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
131
+
132
+ if self.qk_normalization:
133
+ B_, H_, N_, D_ = q.shape
134
+ q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
135
+ k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
136
+
137
+ attn = ((q * self.scale) @ k.transpose(-2, -1))
138
+ attn = attn.softmax(dim=-1)
139
+ attn = self.attn_drop(attn)
140
+
141
+ x = (attn @ v).transpose(1, 2).reshape(B, N, C)
142
+ x = self.proj(x)
143
+ x = self.proj_drop(x)
144
+ return x
145
+
146
+ def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
147
+ qkv = self.qkv(x)
148
+ qkv = rearrange(qkv, 'b s (three h d) -> b s three h d', three=3, h=self.num_heads)
149
+
150
+ if self.qk_normalization:
151
+ q, k, v = qkv.unbind(2)
152
+ q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
153
+ k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
154
+ qkv = torch.stack([q, k, v], dim=2)
155
+
156
+ context, _ = self.inner_attn(
157
+ qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=False
158
+ )
159
+ outs = self.proj(rearrange(context, 'b s h d -> b s (h d)'))
160
+ outs = self.proj_drop(outs)
161
+ return outs
162
+
163
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
164
+ x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
165
+ return x
166
+
167
+
168
+ class InternMLP(nn.Module):
169
+ def __init__(self, config: InternVisionConfig):
170
+ super().__init__()
171
+ self.config = config
172
+ self.act = ACT2FN[config.hidden_act]
173
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
174
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
175
+
176
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
177
+ hidden_states = self.fc1(hidden_states)
178
+ hidden_states = self.act(hidden_states)
179
+ hidden_states = self.fc2(hidden_states)
180
+ return hidden_states
181
+
182
+
183
+ class InternVisionEncoderLayer(nn.Module):
184
+ def __init__(self, config: InternVisionConfig, drop_path_rate: float):
185
+ super().__init__()
186
+ self.embed_dim = config.hidden_size
187
+ self.intermediate_size = config.intermediate_size
188
+
189
+ self.attn = InternAttention(config)
190
+ self.mlp = InternMLP(config)
191
+ self.norm1 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
192
+ self.norm2 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
193
+
194
+ self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
195
+ self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
196
+ self.drop_path1 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
197
+ self.drop_path2 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
198
+
199
+ def forward(
200
+ self,
201
+ hidden_states: torch.Tensor,
202
+ ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor], Optional[Tuple[torch.FloatTensor]]]:
203
+ """
204
+ Args:
205
+ hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
206
+ """
207
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
208
+
209
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states)) * self.ls2)
210
+
211
+ return hidden_states
212
+
213
+
214
+ class InternVisionEncoder(nn.Module):
215
+ """
216
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
217
+ [`InternEncoderLayer`].
218
+
219
+ Args:
220
+ config (`InternConfig`):
221
+ The corresponding vision configuration for the `InternEncoder`.
222
+ """
223
+
224
+ def __init__(self, config: InternVisionConfig):
225
+ super().__init__()
226
+ self.config = config
227
+ # stochastic depth decay rule
228
+ dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
229
+ self.layers = nn.ModuleList([
230
+ InternVisionEncoderLayer(config, dpr[idx]) for idx in range(config.num_hidden_layers)])
231
+ self.gradient_checkpointing = True
232
+
233
+ def forward(
234
+ self,
235
+ inputs_embeds,
236
+ output_hidden_states: Optional[bool] = None,
237
+ return_dict: Optional[bool] = None,
238
+ ) -> Union[Tuple, BaseModelOutput]:
239
+ r"""
240
+ Args:
241
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
242
+ Embedded representation of the inputs. Should be float, not int tokens.
243
+ output_hidden_states (`bool`, *optional*):
244
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
245
+ for more detail.
246
+ return_dict (`bool`, *optional*):
247
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
248
+ """
249
+ output_hidden_states = (
250
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
251
+ )
252
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
253
+
254
+ encoder_states = () if output_hidden_states else None
255
+ hidden_states = inputs_embeds
256
+
257
+ for idx, encoder_layer in enumerate(self.layers):
258
+ if output_hidden_states:
259
+ encoder_states = encoder_states + (hidden_states,)
260
+ if self.gradient_checkpointing and self.training:
261
+ layer_outputs = torch.utils.checkpoint.checkpoint(
262
+ encoder_layer,
263
+ hidden_states)
264
+ else:
265
+ layer_outputs = encoder_layer(
266
+ hidden_states,
267
+ )
268
+ hidden_states = layer_outputs
269
+
270
+ if output_hidden_states:
271
+ encoder_states = encoder_states + (hidden_states,)
272
+
273
+ if not return_dict:
274
+ return tuple(v for v in [hidden_states, encoder_states] if v is not None)
275
+ return BaseModelOutput(
276
+ last_hidden_state=hidden_states, hidden_states=encoder_states
277
+ )
278
+
279
+
280
+ class InternVisionModel(PreTrainedModel):
281
+ main_input_name = 'pixel_values'
282
+ config_class = InternVisionConfig
283
+ _no_split_modules = ['InternVisionEncoderLayer']
284
+
285
+ def __init__(self, config: InternVisionConfig):
286
+ super().__init__(config)
287
+ self.config = config
288
+
289
+ self.embeddings = InternVisionEmbeddings(config)
290
+ self.encoder = InternVisionEncoder(config)
291
+
292
+ def resize_pos_embeddings(self, old_size, new_size, patch_size):
293
+ pos_emb = self.embeddings.position_embedding
294
+ _, num_positions, embed_dim = pos_emb.shape
295
+ cls_emb = pos_emb[:, :1, :]
296
+ pos_emb = pos_emb[:, 1:, :].reshape(1, old_size // patch_size, old_size // patch_size, -1).permute(0, 3, 1, 2)
297
+ pos_emb = F.interpolate(pos_emb.float(), size=new_size // patch_size, mode='bicubic', align_corners=False)
298
+ pos_emb = pos_emb.to(cls_emb.dtype).reshape(1, embed_dim, -1).permute(0, 2, 1)
299
+ pos_emb = torch.cat([cls_emb, pos_emb], dim=1)
300
+ self.embeddings.position_embedding = nn.Parameter(pos_emb)
301
+ logger.info('Resized position embeddings from {} to {}'.format(old_size, new_size))
302
+
303
+ def get_input_embeddings(self):
304
+ return self.embeddings
305
+
306
+ def forward(
307
+ self,
308
+ pixel_values: Optional[torch.FloatTensor] = None,
309
+ output_hidden_states: Optional[bool] = None,
310
+ return_dict: Optional[bool] = None,
311
+ pixel_embeds: Optional[torch.FloatTensor] = None,
312
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
313
+ output_hidden_states = (
314
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
315
+ )
316
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
317
+
318
+ if pixel_values is None and pixel_embeds is None:
319
+ raise ValueError('You have to specify pixel_values or pixel_embeds')
320
+
321
+ if pixel_embeds is not None:
322
+ hidden_states = pixel_embeds
323
+ else:
324
+ if len(pixel_values.shape) == 4:
325
+ hidden_states = self.embeddings(pixel_values)
326
+ else:
327
+ raise ValueError(f'wrong pixel_values size: {pixel_values.shape}')
328
+ encoder_outputs = self.encoder(
329
+ inputs_embeds=hidden_states,
330
+ output_hidden_states=output_hidden_states,
331
+ return_dict=return_dict,
332
+ )
333
+ last_hidden_state = encoder_outputs.last_hidden_state
334
+ pooled_output = last_hidden_state[:, 0, :]
335
+
336
+ if not return_dict:
337
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
338
+
339
+ return BaseModelOutputWithPooling(
340
+ last_hidden_state=last_hidden_state,
341
+ pooler_output=pooled_output,
342
+ hidden_states=encoder_outputs.hidden_states,
343
+ attentions=encoder_outputs.attentions,
344
+ )
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 448,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.485,
9
+ 0.456,
10
+ 0.406
11
+ ],
12
+ "image_std": [
13
+ 0.229,
14
+ 0.224,
15
+ 0.225
16
+ ],
17
+ "resample": 3,
18
+ "size": 448
19
+ }