leaderboard-pr-bot's picture
Adding Evaluation Results
cdcb397 verified
|
raw
history blame
5.35 kB
metadata
license: apache-2.0
model-index:
  - name: LMCocktail-Mistral-7B-v1
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 66.21
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 85.69
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 61.64
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 61.37
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 77.35
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 47.23
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
          name: Open LLM Leaderboard

LM-cocktail Mistral 7B v1

This is a 50%-50% model of two best Mistral models

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1

both claimed to be better than chatgpt-3.5-turbo in almost all metrics.

Alpaca Eval

I am thrilled to announce that ChatGPT has ranked LMCocktail 7B as the second best model next to GPT4 on AlpcaEval in my local community run, even greater than my previously best LMCocktail-10.7B-v1 model. You can also check the leaderboard at ./Alpaca_eval/chatgpt_fn_--LMCocktail-Mistral-7B-v1/

                        win_rate  standard_error  n_total  avg_length
gpt4                       73.79            1.54      805        1365
LMCocktail-7B-v1(new)      73.54            1.55      805        1870
LMCocktail-10.7B-v1(new)   73.45            1.56      804        1203
claude                     70.37            1.60      805        1082
chatgpt                    66.09            1.66      805         811
wizardlm-13b               65.16            1.67      805         985
vicuna-13b                 64.10            1.69      805        1037
guanaco-65b                62.36            1.71      805        1249
oasst-rlhf-llama-33b       62.05            1.71      805        1079
alpaca-farm-ppo-human      60.25            1.72      805         803
falcon-40b-instruct        56.52            1.74      805         662
text_davinci_003           50.00            0.00      805         307
alpaca-7b                  45.22            1.74      805         396
text_davinci_001           28.07            1.56      805         296

Code

The LM-cocktail is novel technique for merging multiple models https://arxiv.org/abs/2311.13534

Code is backed up by this repo https://github.com/FlagOpen/FlagEmbedding.git

Merging scripts available under the ./scripts folder

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 66.58
AI2 Reasoning Challenge (25-Shot) 66.21
HellaSwag (10-Shot) 85.69
MMLU (5-Shot) 61.64
TruthfulQA (0-shot) 61.37
Winogrande (5-shot) 77.35
GSM8k (5-shot) 47.23