HPAI-BSC
/

Llama3.1-Aloe-Beta-70B

@@ -47,11 +47,13 @@ Aloe is trained in 20 medical tasks, resulting in a robust and versatile healthc
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f7a16192950415b637e201/VUYw4IdANKGrH2VOedwH0.png)
-Aloe-70B-Beta is the latest iteration in the Aloe family, building and improving on the success of its predecessor, [Aloe-8B-Alpha](https://huggingface.co/HPAI-BSC/Llama3-Aloe-8B-Alpha) in a larger model size.
-Beta more than triples the training data used by Alpha, for a total of 1.8B tokens, including a wider variety of medical tasks and instructions (e.g., text summarization, explanation, diagnosis, text classification, treatment recommendation, ...).
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f7a16192950415b637e201/bCuV5kZUT9H9UECAOWDRc.png)
 Beta also boosts the alignment and safety stages with respect to Alpha. This includes a [medical preference dataset](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), as well as the red-teaming dataset (available soon).
 Complete training details, model merging configurations, and all training data (including synthetically generated data) can be found below. This includes [the RAG system](https://github.com/HPAI-BSC/prompt_engine) that was developed to test Aloe Beta in a deployment setup. Aloe comes with a healthcare-specific risk assessment to facilitate to the safe use and deployment of such systems.
@@ -73,14 +75,12 @@ Complete training details, model merging configurations, and all training data (
 ## Model Performance
-Aloe Beta has been tested on the most popular healthcare QA datasets, with and without Medprompt inference technique. Results show competitive performance, achieving SOTA within models of the same size.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/s8rWbwpYTkar5_X_LnOhb.png)
-More evaluations coming soon!
 <!---
 The Beta model has been developed to excel in several different medical tasks. For this reason, we evaluated the model in many different medical tasks:
@@ -89,12 +89,18 @@ The Beta model has been developed to excel in several different medical tasks. F
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/2NW3im0aH2u6RKp969sjx.png)
 We also compared the performance of the model in the general domain, using the OpenLLM Leaderboard benchmark. Aloe-Beta gets competitive results with the current SOTA general models in the most used general benchmarks and outperforms the medical models:
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/imK19fzyMUvIJaAbSVnGE.png)
--->
 ## Uses
 ### Direct Use
@@ -243,20 +249,19 @@ We used Deepspeed's Zero-3 distributed training using the following hardware:
 The training set consists of around 1.8B tokens, having 3 different types of data:
-- Medical domain datasets:
   - [HPAI-BSC/Aloe-Beta-General-Collection](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-General-Collection)
   - [HPAI-BSC/chain-of-diagnosis](https://huggingface.co/datasets/HPAI-BSC/chain-of-diagnosis)
   - [HPAI-BSC/MedS-Ins](https://huggingface.co/datasets/HPAI-BSC/MedS-Ins)
   - [HPAI-BSC/ultramedica](https://huggingface.co/datasets/HPAI-BSC/ultramedical)
-- Synthetic data generated using Llama3.1:
   - [HPAI-BSC/pubmedqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/pubmedqa-cot-llama31)
   - [HPAI-BSC/medqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/medqa-cot-llama31)
   - [HPAI-BSC/medmcqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/medmcqa-cot-llama31)
   - [HPAI-BSC/headqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/headqa-cot-llama31)
   - [HPAI-BSC/MMLU-medical-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/MMLU-medical-cot-llama31)
   - [HPAI-BSC/Polymed-QA](https://huggingface.co/datasets/HPAI-BSC/Polymed-QA)
-  - Genstruct data (coming soon)
-- General data:
   - [HPAI-BSC/Aloe-Beta-General-Collection](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-General-Collection)
 #### Training parameters
@@ -280,7 +285,7 @@ The model trained was merged with the Llama-3.1-Instruct model using the DARE_TI
 The model is aligned using the Direct Preference Optimization (DPO) technique through a two-step process:
 1. General DPO Alignment: This step uses a dataset combining medical, general preference, and safety data. We used our dataset [HPAI-BSC/Aloe-Beta-DPO](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-DPO). We split the dataset into five parts, and the model was trained iteratively for one epoch on each chunk. We used a learning rate of 2e-7.
-2. Red-Teaming Alignment: This step further fine-tunes the model to resist a variety of potential attacks, enhancing its robustness and security. Dataset will be shared soon. In this stage, we set the learning rate to 1e-7.
 <!---
 ^^^ LINKS TO DPO DATA (DPO added, missing the RT^^^

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f7a16192950415b637e201/VUYw4IdANKGrH2VOedwH0.png)
+**Aloe-70B-Beta** is the latest iteration in the **Aloe family**, building and improving on the success of its predecessor, [Aloe-8B-Alpha](https://huggingface.co/HPAI-BSC/Llama3-Aloe-8B-Alpha) in a larger model size.
+Beta more than **triples** the training data used by Alpha, for a total of **1.8B tokens**, including a wider variety of medical tasks and instructions (e.g., text summarization, explanation, diagnosis, text classification, treatment recommendation, ...).
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f7a16192950415b637e201/bCuV5kZUT9H9UECAOWDRc.png)
+To mitigate catastrophic forgetting and enable the model to effectively learn new capabilities like **function calling**, we incorporated a diverse set of high-quality general-purpose data constituting 20% of the total training set. The curated data includes some of the highest-quality content available across a range of topics, including mathematics, programming, STEM, and very long instructions (> 8k tokens), to enrich the model's adaptability and comprehension across diverse domains.
 Beta also boosts the alignment and safety stages with respect to Alpha. This includes a [medical preference dataset](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), as well as the red-teaming dataset (available soon).
 Complete training details, model merging configurations, and all training data (including synthetically generated data) can be found below. This includes [the RAG system](https://github.com/HPAI-BSC/prompt_engine) that was developed to test Aloe Beta in a deployment setup. Aloe comes with a healthcare-specific risk assessment to facilitate to the safe use and deployment of such systems.
 ## Model Performance
+Aloe Beta has been tested on the most popular healthcare QA datasets, with and without **Medprompt** inference technique. Results show competitive performance, achieving SOTA within models of the same size.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/s8rWbwpYTkar5_X_LnOhb.png)
 <!---
 The Beta model has been developed to excel in several different medical tasks. For this reason, we evaluated the model in many different medical tasks:
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/2NW3im0aH2u6RKp969sjx.png)
+-->
 We also compared the performance of the model in the general domain, using the OpenLLM Leaderboard benchmark. Aloe-Beta gets competitive results with the current SOTA general models in the most used general benchmarks and outperforms the medical models:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6620f941eba5274b5c12f83d/UKW36y9yjqn3Q5OfrCuIc.png)
+More evaluations coming soon!
 ## Uses
 ### Direct Use
 The training set consists of around 1.8B tokens, having 3 different types of data:
+- Medical domain datasets. Includes data from 20 different medical tasks.
   - [HPAI-BSC/Aloe-Beta-General-Collection](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-General-Collection)
   - [HPAI-BSC/chain-of-diagnosis](https://huggingface.co/datasets/HPAI-BSC/chain-of-diagnosis)
   - [HPAI-BSC/MedS-Ins](https://huggingface.co/datasets/HPAI-BSC/MedS-Ins)
   - [HPAI-BSC/ultramedica](https://huggingface.co/datasets/HPAI-BSC/ultramedical)
+- Synthetic data. We expanded our training data by generating high-quality answers using Llama3.1-70B:
   - [HPAI-BSC/pubmedqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/pubmedqa-cot-llama31)
   - [HPAI-BSC/medqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/medqa-cot-llama31)
   - [HPAI-BSC/medmcqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/medmcqa-cot-llama31)
   - [HPAI-BSC/headqa-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/headqa-cot-llama31)
   - [HPAI-BSC/MMLU-medical-cot-llama31](https://huggingface.co/datasets/HPAI-BSC/MMLU-medical-cot-llama31)
   - [HPAI-BSC/Polymed-QA](https://huggingface.co/datasets/HPAI-BSC/Polymed-QA)
+- General data. It includes maths, STEM, code, function calling, and instruction of very long instructions.
   - [HPAI-BSC/Aloe-Beta-General-Collection](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-General-Collection)
 #### Training parameters
 The model is aligned using the Direct Preference Optimization (DPO) technique through a two-step process:
 1. General DPO Alignment: This step uses a dataset combining medical, general preference, and safety data. We used our dataset [HPAI-BSC/Aloe-Beta-DPO](https://huggingface.co/datasets/HPAI-BSC/Aloe-Beta-DPO). We split the dataset into five parts, and the model was trained iteratively for one epoch on each chunk. We used a learning rate of 2e-7.
+2. Red-Teaming Alignment: This step further fine-tunes the model to resist a variety of potential attacks, enhancing its robustness and security. The dataset will be shared soon. In this stage, we set the learning rate to 1e-7.
 <!---
 ^^^ LINKS TO DPO DATA (DPO added, missing the RT^^^