license: apache-2.0
model-index:
- name: LMCocktail-Mistral-7B-v1
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 66.21
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 85.69
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 61.64
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 61.37
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 77.35
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 47.23
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Yhyu13/LMCocktail-Mistral-7B-v1
name: Open LLM Leaderboard
LM-cocktail Mistral 7B v1
This is a 50%-50% model of two best Mistral models
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1
both claimed to be better than chatgpt-3.5-turbo in almost all metrics.
Alpaca Eval
I am thrilled to announce that ChatGPT has ranked LMCocktail 7B as the second best model next to GPT4 on AlpcaEval in my local community run, even greater than my previously best LMCocktail-10.7B-v1 model. You can also check the leaderboard at ./Alpaca_eval/chatgpt_fn_--LMCocktail-Mistral-7B-v1/
win_rate standard_error n_total avg_length
gpt4 73.79 1.54 805 1365
LMCocktail-7B-v1(new) 73.54 1.55 805 1870
LMCocktail-10.7B-v1(new) 73.45 1.56 804 1203
claude 70.37 1.60 805 1082
chatgpt 66.09 1.66 805 811
wizardlm-13b 65.16 1.67 805 985
vicuna-13b 64.10 1.69 805 1037
guanaco-65b 62.36 1.71 805 1249
oasst-rlhf-llama-33b 62.05 1.71 805 1079
alpaca-farm-ppo-human 60.25 1.72 805 803
falcon-40b-instruct 56.52 1.74 805 662
text_davinci_003 50.00 0.00 805 307
alpaca-7b 45.22 1.74 805 396
text_davinci_001 28.07 1.56 805 296
Code
The LM-cocktail is novel technique for merging multiple models https://arxiv.org/abs/2311.13534
Code is backed up by this repo https://github.com/FlagOpen/FlagEmbedding.git
Merging scripts available under the ./scripts folder
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 66.58 |
AI2 Reasoning Challenge (25-Shot) | 66.21 |
HellaSwag (10-Shot) | 85.69 |
MMLU (5-Shot) | 61.64 |
TruthfulQA (0-shot) | 61.37 |
Winogrande (5-shot) | 77.35 |
GSM8k (5-shot) | 47.23 |