zR commited on
Commit
d2b1ebe
·
1 Parent(s): d16a569

update gpu memory cost

Browse files
Files changed (2) hide show
  1. README.md +44 -14
  2. README_zh.md +19 -6
README.md CHANGED
@@ -23,6 +23,9 @@ inference: false
23
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
24
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
25
  </p>
 
 
 
26
 
27
  ## Demo Show
28
 
@@ -109,7 +112,9 @@ inference: false
109
 
110
  ## Model Introduction
111
 
112
- CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
 
 
113
 
114
  <table style="border-collapse: collapse; width: 100%;">
115
  <tr>
@@ -128,9 +133,9 @@ CogVideoX is an open-source version of the video generation model originating fr
128
  <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
129
  </tr>
130
  <tr>
131
- <td style="text-align: center;">Single GPU VRAM Consumption</td>
132
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
133
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
134
  </tr>
135
  <tr>
136
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@@ -181,13 +186,34 @@ CogVideoX is an open-source version of the video generation model originating fr
181
 
182
  **Data Explanation**
183
 
184
- - When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the value in the table.
185
- - When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
186
- - Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced.
187
- - The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is recommended to use the precision used in model training for inference.
188
- - `FP8` precision must be used on `NVIDIA H100` and above devices, requiring source installation of the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended.
189
- - Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference speed increases by about 10%. Only models using `diffusers` support quantization.
190
- - The model only supports English input; other languages can be translated to English during large model refinements.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
  **Note**
193
 
@@ -242,7 +268,10 @@ export_to_video(video, "output.mp4", fps=8)
242
 
243
  ## Quantized Inference
244
 
245
- [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
 
 
 
246
 
247
  ```diff
248
  # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
@@ -290,11 +319,12 @@ video = pipe(
290
  export_to_video(video, "output.mp4", fps=8)
291
  ```
292
 
293
- Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
 
 
294
  - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
295
  - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
296
 
297
-
298
  ## Explore the Model
299
 
300
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
 
23
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
24
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
25
  </p>
26
+ <p align="center">
27
+ 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience commercial video generation models.
28
+ </p>
29
 
30
  ## Demo Show
31
 
 
112
 
113
  ## Model Introduction
114
 
115
+ CogVideoX is an open-source version of the video generation model originating
116
+ from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation
117
+ models we currently offer, along with their foundational information.
118
 
119
  <table style="border-collapse: collapse; width: 100%;">
120
  <tr>
 
133
  <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
134
  </tr>
135
  <tr>
136
+ <td style="text-align: center;">Single GPU VRAM Consumption<br></td>
137
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: starting from 4GB*</b><br><b>diffusers INT8(torchao): starting from 3.6GB*</b></td>
138
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: starting from 5GB*</b><br><b>diffusers INT8(torchao): starting from 4.4GB*</b></td>
139
  </tr>
140
  <tr>
141
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
 
186
 
187
  **Data Explanation**
188
 
189
+ + When testing using the `diffusers` library, all optimizations provided by the `diffusers` library were enabled. This
190
+ solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100 / H100**. Generally,
191
+ this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If the optimizations are
192
+ disabled, VRAM usage will increase significantly, with peak VRAM usage being about 3 times higher than the table
193
+ shows. However, speed will increase by 3-4 times. You can selectively disable some optimizations, including:
194
+
195
+ ```
196
+ pipe.enable_model_cpu_offload()
197
+ pipe.enable_sequential_cpu_offload()
198
+ pipe.vae.enable_slicing()
199
+ pipe.vae.enable_tiling()
200
+ ```
201
+
202
+ + When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
203
+ + Using INT8 models will reduce inference speed. This is to ensure that GPUs with lower VRAM can perform inference
204
+ normally while maintaining minimal video quality loss, though inference speed will decrease significantly.
205
+ + The 2B model is trained with `FP16` precision, and the 5B model is trained with `BF16` precision. We recommend using
206
+ the precision the model was trained with for inference.
207
+ + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
208
+ used to quantize the text encoder, Transformer, and VAE modules to reduce CogVideoX's memory requirements. This makes
209
+ it possible to run the model on a free T4 Colab or GPUs with smaller VRAM! It is also worth noting that TorchAO
210
+ quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. `FP8`
211
+ precision must be used on devices with `NVIDIA H100` or above, which requires installing
212
+ the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages from source. `CUDA 12.4` is recommended.
213
+ + The inference speed test also used the above VRAM optimization scheme. Without VRAM optimization, inference speed
214
+ increases by about 10%. Only the `diffusers` version of the model supports quantization.
215
+ + The model only supports English input; other languages can be translated into English during refinement by a large
216
+ model.
217
 
218
  **Note**
219
 
 
268
 
269
  ## Quantized Inference
270
 
271
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
272
+ used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes
273
+ it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO
274
+ quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
275
 
276
  ```diff
277
  # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
 
319
  export_to_video(video, "output.mp4", fps=8)
320
  ```
321
 
322
+ Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO.
323
+ Find examples and benchmarks at these links:
324
+
325
  - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
326
  - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
327
 
 
328
  ## Explore the Model
329
 
330
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
README_zh.md CHANGED
@@ -10,6 +10,9 @@
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
 
 
 
13
 
14
  ## 作品案例
15
 
@@ -116,8 +119,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
116
  </tr>
117
  <tr>
118
  <td style="text-align: center;">单GPU显存消耗<br></td>
119
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
120
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
121
  </tr>
122
  <tr>
123
  <td style="text-align: center;">多GPU推理显存消耗</td>
@@ -168,13 +171,23 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
168
 
169
  **数据解释**
170
 
171
- + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
172
- **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
173
- 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
 
 
 
 
 
 
174
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
175
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
176
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
177
- + `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用,需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate` python包,推荐使用 `CUDA 12.4`。
 
 
 
 
178
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
179
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
180
 
 
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
13
+ <p align="center">
14
+ 📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验商业版视频生成模型
15
+ </p>
16
 
17
  ## 作品案例
18
 
 
119
  </tr>
120
  <tr>
121
  <td style="text-align: center;">单GPU显存消耗<br></td>
122
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
123
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
124
  </tr>
125
  <tr>
126
  <td style="text-align: center;">多GPU推理显存消耗</td>
 
171
 
172
  **数据解释**
173
 
174
+ + 使用 diffusers 库进行测试时,启用了全部`diffusers`库自带的优化,该方案未测试在非**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
175
+ 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化,这些优化包括:
176
+ ```
177
+ pipe.enable_model_cpu_offload()
178
+ pipe.enable_sequential_cpu_offload()
179
+ pipe.vae.enable_slicing()
180
+ pipe.vae.enable_tiling()
181
+ ```
182
+
183
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
184
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
185
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
186
+ + [PytorchAO](https://github.com/pytorch/ao) [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
187
+ 可以用于量化文本编码器、Transformer 和 VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
188
+ 上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
189
+ 及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
190
+ 包。建议使用 `CUDA 12.4`。
191
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
192
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
193