--- language: - en license: mit library_name: transformers model-index: - name: free-evo-qwen72b-v0.8-re results: - task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics: - type: acc_norm value: 79.86 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics: - type: acc_norm value: 91.34 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics: - type: acc value: 78.00 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics: - type: mc2 value: 74.85 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics: - type: acc value: 87.77 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 75.89 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re name: Open LLM Leaderboard --- # Model Card for free-evo-qwen72b-v0.8 ## Developed by : [Freewheelin](https://freewheelin-recruit.oopy.io/) AI Technical Team ## 2024 4th May - avg. 81.28 [Open Llm Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) | Metric |Value| |---------------------------------|----:| |Avg. |81.28| |ARC (25-Shot) |79.86| |HellaSwag (10-Shot) |91.32| |MMLU (5-Shot) |78.00| |TruthfulQA (0-shot) |74.85| |Winogrande (5-shot) |87.77| |GSM8k (5-shot) |75.89| ## Method - We were inspired by this [Sakana project](https://sakana.ai/evolutionary-model-merge/) ## Process You need two models with the same architecture. - Choose one model and fine-tune it to create a gap between the original model and the fine-tuned one. It doesn't matter whether the evaluation score is higher or lower. - Merge the two models. - Evaluate the merged model. - Fine-tune a specific evaluation part of the model if you need to increase the score for that part. (It's unlikely to work as you think, but you can try it.) - Merge the models again. - Evaluate again. - Keep going until the average evaluation score is higher than the original one. That's it. Simple. You can create a framework to automate this process. ## Base Architecture - QWEN2 ## Base Models - several QWEN2 based models