Tencent-Hunyuan
/

HunyuanCaptioner

Model card Files Files and versions Community

HunyuanCaptioner / README.md

Zhiminli's picture

Update README.md

1832325 verified 7 months ago

|

history blame contribute delete

3.76 kB

	---
	library_name: hunyuan-dit
	license: other
	license_name: tencent-hunyuan-community
	license_link: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/LICENSE.txt
	language:
	- en
	- zh
	---

	## Hunyuan-Captioner
	Hunyuan-Captioner meets the need of text-to-image techniques by maintaining a high degree of image-text consistency. It can generate high-quality image descriptions from a variety of angles, including object description, objects relationships, background information, image style, etc. Our code is based on [LLaVA](https://github.com/haotian-liu/LLaVA) implementation.

	### Instructions
	a. Install dependencies

	The dependencies and installation are basically the same as the [base model](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1).

	b. Data download
	```shell
	cd HunyuanDiT
	wget -O ./dataset/data_demo.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/data_demo.zip
	unzip ./dataset/data_demo.zip -d ./dataset
	mkdir ./dataset/porcelain/arrows ./dataset/porcelain/jsons
	```

	c. Model download
	```shell
	# Use the huggingface-cli tool to download the model.
	huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner
	```


	### Inference

	Current supported prompt templates:

	\|Mode \| Prompt template \|Description \|
	\| --- \| --- \| --- \|
	\|caption_zh \| 描述这张图片 \|Caption in Chinese \|
	\|insert_content \| 根据提示词“{}”,描述这张图片 \|Insert specific knowledge into caption\|
	\|caption_en \| Please describe the content of this image \|Caption in English \|
	\| \| \| \|


	a. Single picture inference in Chinese

	```bash
	python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner"
	```

	b. Insert specific knowledge into caption

	```bash
	python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner"
	```

	c. Single picture inference in English

	```bash
	python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner"
	```

	d. Multiple pictures inference in Chinese

	```bash
	### Convert multiple pictures to csv file.
	python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv"

	### Multiple pictures inference
	python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner"
	```

	(Optional) To convert the output csv file to Arrow format, please refer to [Data Preparation #3](https://github.com/Tencent/HunyuanDiT?tab=readme-ov-file#data-preparation) for detailed instructions.


	### Gradio
	To launch a Gradio demo locally, please execute the following commands sequentially. Ensure each command is running in the background. For more detailed instructions, please refer to [LLaVA](https://github.com/haotian-liu/LLaVA).
	```bash
	cd mllm
	python -m llava.serve.controller --host 0.0.0.0 --port 10000
	python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443
	python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "../ckpts/captioner" --model-name LlavaMistral
	```
	Then the demo can be accessed through http://0.0.0.0:443. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP.