--- license: apache-2.0 datasets: - google/docci - gokaygokay/random_instruct_docci language: - en pipeline_tag: image-text-to-text --- Fine tuned version of [moondream2](https://huggingface.co/vikhyatk/moondream2) model using [gokaygokay/random_instruct_docci](https://huggingface.co/datasets/gokaygokay/random_instruct_docci) dataset. Which gives extremely detailed captions of the images. ``` pip install transformers timm einops bitsandbytes accelerate flash-attn ``` ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM from PIL import Image DEVICE = "cuda" DTYPE = ( torch.float32 if DEVICE == "cpu" else torch.float16 ) # CPU doesn't support float16 revision = "3ec40c7b6b5d87bc0c51edee45e21f5f29b449d8" tokenizer = AutoTokenizer.from_pretrained( "fal-ai/moondream2-docci-instruct", trust_remote_code=True, revision=revision ) moondream = AutoModelForCausalLM.from_pretrained( "fal-ai/moondream2-docci-instruct", trust_remote_code=True, torch_dtype=DTYPE, device_map={"": DEVICE}, attn_implementation="flash_attention_2", revision=revision ) moondream.eval() image_path = "" image = Image.open(image_path).convert("RGB") md_answer = moondream.answer_question( moondream.encode_image(image), "what is this picture about", tokenizer=tokenizer, ) print(md_answer) ```