openGPT-X
/

Teuken-7B-instruct-research-v0.4

@@ -32,7 +32,7 @@ library_name: transformers
 base_model:
 - openGPT-X/Teuken-7B-base-v0.4
 ---
-# Model Card for Teuken-7B-instruct-v0.4
 [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
@@ -52,7 +52,7 @@ Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4]
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
 ## Disclaimer Toxic Content:
@@ -69,7 +69,7 @@ The model is not intended for use in math and coding tasks.
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
-Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
 ## How to Get Started with the Model
@@ -142,37 +142,48 @@ More information are available in our [preprint](http://arxiv.org/abs/2410.08800
 ### Instruction-Tuning Data
 ### English
-| Dataset file                                          | Sample Count |
 | ----------------------------------------------------- | ------------ |
-| en/bactrianx_EN_fastchat.jsonl                        | 66985        |
-| en/code_alpaca_fastchat.jsonl                         | 19990        |
-| en/evol_instruct_143k_fastchat.jsonl                  | 142968       |
-| en/evol_instruct_70k_fastchat.jsonl                   | 69968        |
-| en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651        |
-| en/open_orca_fastchat_aa.jsonl                        | 599968       |
-| en/open_orca_fastchat_ab.jsonl                        | 599968       |
-| en/open_orca_fastchat_ac.jsonl                        | 599968       |
-| en/open_orca_fastchat_ad.jsonl                        | 599968       |
-| en/open_orca_fastchat_ag.jsonl                        | 599968       |
-| en/open_orca_fastchat_ah.jsonl                        | 33891        |
-| en/sharegpt_v3_unfiltered_fastchat.jsonl              | 93880        |
-| en/ultrachat_200k_fastchat.jsonl                      | 11525        |
-| **total**                                             | **3457698**  |
 ### German
-| Dataset file                                                | Sample Count |
 | ----------------------------------------------------------- | ------------ |
-| de/bactrianx_DE_fastchat.jsonl                              | 67017        |
-| de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl   | 49969        |
-| de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022        |
-| de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl      | 6101         |
-| de/german_poems_fastchat.jsonl                              | 400          |
-| de/german_songs_fastchat.jsonl                              | 1000         |
-| de/ultrachat_de_1k_fastchat.jsonl                           | 959          |
-| **total**                                                   | **184468**   |
 ### Training Procedure

 base_model:
 - openGPT-X/Teuken-7B-base-v0.4
 ---
+# Model Card for Teuken-7B-instruct-research-v0.4
 [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+Teuken-7B-instruct-research-v0.4 is intended for research use in all official 24 European languages. Since Teuken-7B-instruct-research-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
 ## Disclaimer Toxic Content:
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
+Teuken-7B-instruct-research-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
 ## How to Get Started with the Model
 ### Instruction-Tuning Data
+For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables.
 ### English
+* We only included a subsample of the OpenOrca dataset.
+* For the LMSYS-Chat dataset, we selected only the high-quality criteria in [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://arxiv.org/abs/2309.11998), i.e., if the model answer stems from any of "GPT-3.5-turbo", "GPT-4",  "Claude-1", "Claude-instant-1" or "Claude-2" and is English.
+* To select instruction-tuning examples based on their quality, We calculated the reward scores of all English examples utilizing [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) (Apache-2.0 license)
+For English data, we did the following steps for sample selection:
+  1. Add all multi-turn examples
+  2. Add entire `code_alpaca` dataset subset
+  3. Add entire `lmsys_chat_1m_high_quality_train_en` dataset subset
+  4. For the remaining dataset subsets (`open_orca`, `evol_instruct_143k`, `evol_instruct_70k`, `sharegpt_v3`, `ultrachat_200k`, `bactrianx_EN`), we add the samples with the highest reward scores so that each dataset subset contributes an equal amount of high-quality examples
+| Dataset                                               | Sample Count |
 | ----------------------------------------------------- | ------------ |
+| anon8231489123/ShareGPT_Vicuna_unfiltered             | 37.6K        |
+| MBZUAI/Bactrian-X                                     | 26.9K        |
+| Open-Orca/OpenOrca                                    | 26.9K        |
+| WizardLM/WizardLM_evol_instruct_70k                   | 26.9K        |
+| WizardLM/WizardLM_evol_instruct_V2_196k               | 26.8K        |
+| sahil2801/CodeAlpaca-20k                              | 12.1K        |
+| lmsys/lmsys-chat-1m                                   | 11.2K        |
+| HuggingFaceH4/ultrachat_200k                          | 7.0K         |
+| **total**                                             | **175,5K**   |
 ### German
+For German data we include the complete data sets from the given table:
+| Dataset                                                     | Sample Count |
 | ----------------------------------------------------------- | ------------ |
+| MBZUAI/Bactrian-X DE                                        | 63.7K        |
+| FreedomIntelligence/evol-instruct-deutsch                   | 55.9K        |
+| FreedomIntelligence/alpaca-gpt4-deutsch                     | 47.5K        |
+| FreedomIntelligence/sharegpt-deutsch                        | 5.8K         |
+| LeoLM/German_Songs                                          | 943          |
+| LeoLM/German_Poems                                          | 378          |
+| bjoernp/ultrachat_de                                        | 909          |
+| **total**                                                   | **175,13K**  |
 ### Training Procedure