zR commited on
Commit
db99411
·
1 Parent(s): 9049557

update gpu memory cost

Browse files
Files changed (2) hide show
  1. README.md +55 -23
  2. README_zh.md +21 -8
README.md CHANGED
@@ -1,12 +1,12 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
  tags:
6
- - cogvideox
7
- - video-generation
8
- - thudm
9
- - text-to-video
10
  inference: false
11
  ---
12
 
@@ -22,6 +22,9 @@ inference: false
22
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
23
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
24
  </p>
 
 
 
25
 
26
  ## Demo Show
27
 
@@ -84,7 +87,9 @@ inference: false
84
 
85
  ## Model Introduction
86
 
87
- CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
 
 
88
 
89
  <table style="border-collapse: collapse; width: 100%;">
90
  <tr>
@@ -103,9 +108,9 @@ CogVideoX is an open-source version of the video generation model originating fr
103
  <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
104
  </tr>
105
  <tr>
106
- <td style="text-align: center;">Single GPU VRAM Consumption</td>
107
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
108
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
109
  </tr>
110
  <tr>
111
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@@ -156,20 +161,40 @@ CogVideoX is an open-source version of the video generation model originating fr
156
 
157
  **Data Explanation**
158
 
159
- + When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested on devices other than **NVIDIA A100 / H100**. Typically, this solution is adaptable to all devices above the **NVIDIA Ampere architecture**. If the optimization is disabled, memory usage will increase significantly, with peak memory being about 3 times the table value.
160
- + The CogVideoX-2B model was trained using `FP16` precision, so it is recommended to use `FP16` for inference.
161
- + For multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
162
- + Using the INT8 model will lead to reduced inference speed. This is done to allow low-memory GPUs to perform inference while maintaining minimal video quality loss, though the inference speed will be significantly reduced.
163
- + Inference speed tests also used the memory optimization mentioned above. Without memory optimization, inference speed increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
164
- + The model only supports English input; other languages can be translated to English for refinement by large models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
  **Note**
167
 
168
  + Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
169
  models. Feel free to visit our GitHub for more information.
170
 
171
-
172
-
173
  ## Quick Start 🤗
174
 
175
  This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
@@ -202,8 +227,9 @@ pipe = CogVideoXPipeline.from_pretrained(
202
  )
203
 
204
  pipe.enable_model_cpu_offload()
 
 
205
  pipe.vae.enable_tiling()
206
-
207
  video = pipe(
208
  prompt=prompt,
209
  num_videos_per_prompt=1,
@@ -218,7 +244,10 @@ export_to_video(video, "output.mp4", fps=8)
218
 
219
  ## Quantized Inference
220
 
221
- [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
 
 
 
222
 
223
  ```diff
224
  # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
@@ -266,11 +295,12 @@ video = pipe(
266
  export_to_video(video, "output.mp4", fps=8)
267
  ```
268
 
269
- Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
 
 
270
  - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
271
  - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
272
 
273
-
274
  ## Explore the Model
275
 
276
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
@@ -284,9 +314,11 @@ Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
284
 
285
  ## Model License
286
 
287
- The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the [Apache 2.0 License](LICENSE).
 
288
 
289
- The CogVideoX-5B model (Transformers module) is released under the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
 
290
 
291
  ## Citation
292
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
  tags:
6
+ - cogvideox
7
+ - video-generation
8
+ - thudm
9
+ - text-to-video
10
  inference: false
11
  ---
12
 
 
22
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
23
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
24
  </p>
25
+ <p align="center">
26
+ 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience commercial video generation models.
27
+ </p>
28
 
29
  ## Demo Show
30
 
 
87
 
88
  ## Model Introduction
89
 
90
+ CogVideoX is an open-source version of the video generation model originating
91
+ from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation
92
+ models we currently offer, along with their foundational information.
93
 
94
  <table style="border-collapse: collapse; width: 100%;">
95
  <tr>
 
108
  <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
109
  </tr>
110
  <tr>
111
+ <td style="text-align: center;">Single GPU VRAM Consumption<br></td>
112
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: starting from 4GB*</b><br><b>diffusers INT8(torchao): starting from 3.6GB*</b></td>
113
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: starting from 5GB*</b><br><b>diffusers INT8(torchao): starting from 4.4GB*</b></td>
114
  </tr>
115
  <tr>
116
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
 
161
 
162
  **Data Explanation**
163
 
164
+ + When testing using the `diffusers` library, all optimizations provided by the `diffusers` library were enabled. This
165
+ solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100 / H100**. Generally,
166
+ this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If the optimizations are
167
+ disabled, VRAM usage will increase significantly, with peak VRAM usage being about 3 times higher than the table
168
+ shows. However, speed will increase by 3-4 times. You can selectively disable some optimizations, including:
169
+
170
+ ```
171
+ pipe.enable_model_cpu_offload()
172
+ pipe.enable_sequential_cpu_offload()
173
+ pipe.vae.enable_slicing()
174
+ pipe.vae.enable_tiling()
175
+ ```
176
+
177
+ + When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
178
+ + Using INT8 models will reduce inference speed. This is to ensure that GPUs with lower VRAM can perform inference
179
+ normally while maintaining minimal video quality loss, though inference speed will decrease significantly.
180
+ + The 2B model is trained with `FP16` precision, and the 5B model is trained with `BF16` precision. We recommend using
181
+ the precision the model was trained with for inference.
182
+ + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
183
+ used to quantize the text encoder, Transformer, and VAE modules to reduce CogVideoX's memory requirements. This makes
184
+ it possible to run the model on a free T4 Colab or GPUs with smaller VRAM! It is also worth noting that TorchAO
185
+ quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. `FP8`
186
+ precision must be used on devices with `NVIDIA H100` or above, which requires installing
187
+ the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages from source. `CUDA 12.4` is recommended.
188
+ + The inference speed test also used the above VRAM optimization scheme. Without VRAM optimization, inference speed
189
+ increases by about 10%. Only the `diffusers` version of the model supports quantization.
190
+ + The model only supports English input; other languages can be translated into English during refinement by a large
191
+ model.
192
 
193
  **Note**
194
 
195
  + Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
196
  models. Feel free to visit our GitHub for more information.
197
 
 
 
198
  ## Quick Start 🤗
199
 
200
  This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
 
227
  )
228
 
229
  pipe.enable_model_cpu_offload()
230
+ pipe.enable_sequential_cpu_offload()
231
+ pipe.vae.enable_slicing()
232
  pipe.vae.enable_tiling()
 
233
  video = pipe(
234
  prompt=prompt,
235
  num_videos_per_prompt=1,
 
244
 
245
  ## Quantized Inference
246
 
247
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
248
+ used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes
249
+ it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO
250
+ quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
251
 
252
  ```diff
253
  # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
 
295
  export_to_video(video, "output.mp4", fps=8)
296
  ```
297
 
298
+ Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO.
299
+ Find examples and benchmarks at these links:
300
+
301
  - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
302
  - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
303
 
 
304
  ## Explore the Model
305
 
306
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
 
314
 
315
  ## Model License
316
 
317
+ The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under
318
+ the [Apache 2.0 License](LICENSE).
319
 
320
+ The CogVideoX-5B model (Transformers module) is released under
321
+ the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
322
 
323
  ## Citation
324
 
README_zh.md CHANGED
@@ -10,6 +10,9 @@
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
 
 
 
13
 
14
  ## 作品案例
15
 
@@ -92,8 +95,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
92
  </tr>
93
  <tr>
94
  <td style="text-align: center;">单GPU显存消耗<br></td>
95
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
96
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
97
  </tr>
98
  <tr>
99
  <td style="text-align: center;">多GPU推理显存消耗</td>
@@ -144,14 +147,23 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
144
 
145
  **数据解释**
146
 
147
- + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
148
- **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
149
- 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
 
 
 
 
 
 
150
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
151
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
152
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
153
- + `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用,需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate`
154
- python包,推荐使用 `CUDA 12.4`。
 
 
 
155
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
156
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
157
 
@@ -190,8 +202,9 @@ pipe = CogVideoXPipeline.from_pretrained(
190
  )
191
 
192
  pipe.enable_model_cpu_offload()
 
 
193
  pipe.vae.enable_tiling()
194
-
195
  video = pipe(
196
  prompt=prompt,
197
  num_videos_per_prompt=1,
 
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
13
+ <p align="center">
14
+ 📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验商业版视频生成模型
15
+ </p>
16
 
17
  ## 作品案例
18
 
 
95
  </tr>
96
  <tr>
97
  <td style="text-align: center;">单GPU显存消耗<br></td>
98
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
99
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
100
  </tr>
101
  <tr>
102
  <td style="text-align: center;">多GPU推理显存消耗</td>
 
147
 
148
  **数据解释**
149
 
150
+ + 使用 diffusers 库进行测试时,启用了全部`diffusers`库自带的优化,该方案未测试在非**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
151
+ 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化,这些优化包括:
152
+ ```
153
+ pipe.enable_model_cpu_offload()
154
+ pipe.enable_sequential_cpu_offload()
155
+ pipe.vae.enable_slicing()
156
+ pipe.vae.enable_tiling()
157
+ ```
158
+
159
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
160
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
161
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
162
+ + [PytorchAO](https://github.com/pytorch/ao) [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
163
+ 可以用于量化文本编码器、Transformer VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
164
+ 上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
165
+ 及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
166
+ 包。建议使用 `CUDA 12.4`。
167
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
168
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
169
 
 
202
  )
203
 
204
  pipe.enable_model_cpu_offload()
205
+ pipe.enable_sequential_cpu_offload()
206
+ pipe.vae.enable_slicing()
207
  pipe.vae.enable_tiling()
 
208
  video = pipe(
209
  prompt=prompt,
210
  num_videos_per_prompt=1,