|
--- |
|
library_name: hunyuan-dit |
|
license: other |
|
license_name: tencent-hunyuan-community |
|
license_link: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/LICENSE.txt |
|
language: |
|
- en |
|
- zh |
|
--- |
|
|
|
## Hunyuan-Captioner |
|
Hunyuan-Captioner meets the need of text-to-image techniques by maintaining a high degree of image-text consistency. It can generate high-quality image descriptions from a variety of angles, including object description, objects relationships, background information, image style, etc. Our code is based on [LLaVA](https://github.com/haotian-liu/LLaVA) implementation. |
|
|
|
### Instructions |
|
a. Install dependencies |
|
|
|
The dependencies and installation are basically the same as the [**base model**](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1). |
|
|
|
b. Data download |
|
```shell |
|
cd HunyuanDiT |
|
wget -O ./dataset/data_demo.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/data_demo.zip |
|
unzip ./dataset/data_demo.zip -d ./dataset |
|
mkdir ./dataset/porcelain/arrows ./dataset/porcelain/jsons |
|
``` |
|
|
|
c. Model download |
|
```shell |
|
# Use the huggingface-cli tool to download the model. |
|
huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner |
|
``` |
|
|
|
|
|
### Inference |
|
|
|
Current supported prompt templates: |
|
|
|
|Mode | Prompt template |Description | |
|
| --- | --- | --- | |
|
|caption_zh | 描述这张图片 |Caption in Chinese | |
|
|insert_content | 根据提示词“{}”,描述这张图片 |Insert specific knowledge into caption| |
|
|caption_en | Please describe the content of this image |Caption in English | |
|
| | | | |
|
|
|
|
|
a. Single picture inference in Chinese |
|
|
|
```bash |
|
python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner" |
|
``` |
|
|
|
b. Insert specific knowledge into caption |
|
|
|
```bash |
|
python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner" |
|
``` |
|
|
|
c. Single picture inference in English |
|
|
|
```bash |
|
python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner" |
|
``` |
|
|
|
d. Multiple pictures inference in Chinese |
|
|
|
```bash |
|
### Convert multiple pictures to csv file. |
|
python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv" |
|
|
|
### Multiple pictures inference |
|
python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner" |
|
``` |
|
|
|
(Optional) To convert the output csv file to Arrow format, please refer to [Data Preparation #3](https://github.com/Tencent/HunyuanDiT?tab=readme-ov-file#data-preparation) for detailed instructions. |
|
|
|
|
|
### Gradio |
|
To launch a Gradio demo locally, please execute the following commands sequentially. Ensure each command is running in the background. For more detailed instructions, please refer to [LLaVA](https://github.com/haotian-liu/LLaVA). |
|
```bash |
|
cd mllm |
|
python -m llava.serve.controller --host 0.0.0.0 --port 10000 |
|
python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443 |
|
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "../ckpts/captioner" --model-name LlavaMistral |
|
``` |
|
Then the demo can be accessed through http://0.0.0.0:443. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP. |
|
|