metadata

base_model: unsloth/Llama-3.2-11B-Vision-Instruct
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - mllama
license: apache-2.0
language:
  - en
model-index:
  - name: DocumentCogito
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: wis-k/instruction-following-eval
          split: train
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 50.64
            name: averaged accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: SaylorTwift/bbh
          split: test
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 29.79
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: lighteval/MATH-Hard
          split: test
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 16.24
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          split: train
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.84
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.6
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 31.14
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FDocumentCogito
          name: Open LLM Leaderboard

unsloth/Llama-3.2-11B-Vision-Instruct (Fine-Tuned)

Model Overview

This model, fine-tuned from the unsloth/Llama-3.2-11B-Vision-Instruct base, is optimized for vision-language tasks with enhanced instruction-following capabilities. Fine-tuning was completed 2x faster using the Unsloth framework combined with Hugging Face's TRL library, ensuring efficient training while maintaining high performance.

Key Information

Developed by: Daemontatox
Base Model: unsloth/Llama-3.2-11B-Vision-Instruct
License: Apache-2.0
Language: English (en)
Frameworks Used: Hugging Face Transformers, Unsloth, and TRL

Performance and Use Cases

This model is ideal for applications involving:

Vision-based text generation and description tasks
Instruction-following in multimodal contexts
General-purpose text generation with enhanced reasoning

Features

2x Faster Training: Leveraging the Unsloth framework for accelerated fine-tuning.
Multimodal Capabilities: Enhanced to handle vision-language interactions.
Instruction Optimization: Tailored for improved comprehension and execution of instructions.

How to Use

Inference Example (Hugging Face Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Daemontatox/finetuned-llama-3.2-vision-instruct")
model = AutoModelForCausalLM.from_pretrained("Daemontatox/finetuned-llama-3.2-vision-instruct")

input_text = "Describe the image showing a sunset over mountains."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Daemontatox__DocumentCogito-details)!
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=Daemontatox%2FDocumentCogito&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!

|      Metric       |Value (%)|
|-------------------|--------:|
|**Average**        |    24.21|
|IFEval (0-Shot)    |    50.64|
|BBH (3-Shot)       |    29.79|
|MATH Lvl 5 (4-Shot)|    16.24|
|GPQA (0-shot)      |     8.84|
|MuSR (0-shot)      |     8.60|
|MMLU-PRO (5-shot)  |    31.14|