openGPT-X
/

Teuken-7B-instruct-research-v0.4

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mfromm commited on Oct 25, 2024

Commit

8204548

·

verified ·

1 Parent(s): 4acfc1e

Update README.md

Files changed (1) hide show

README.md +33 -5

README.md CHANGED Viewed

@@ -139,12 +139,40 @@ Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly a
 The pretraining data has a cutoff of September 2023.
 More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
-For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
-  1. Add all multi-turn examples
-  2. Add the entire code_alpaca dataset subset
-  3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
-  4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
 ### Training Procedure

 The pretraining data has a cutoff of September 2023.
 More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
+### Instruction-Tuning Data
+### English
+| Dataset file                                          | Sample Count |
+| ----------------------------------------------------- | ------------ |
+| en/bactrianx_EN_fastchat.jsonl                        | 66985        |
+| en/code_alpaca_fastchat.jsonl                         | 19990        |
+| en/evol_instruct_143k_fastchat.jsonl                  | 142968       |
+| en/evol_instruct_70k_fastchat.jsonl                   | 69968        |
+| en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651        |
+| en/open_orca_fastchat_aa.jsonl                        | 599968       |
+| en/open_orca_fastchat_ab.jsonl                        | 599968       |
+| en/open_orca_fastchat_ac.jsonl                        | 599968       |
+| en/open_orca_fastchat_ad.jsonl                        | 599968       |
+| en/open_orca_fastchat_ag.jsonl                        | 599968       |
+| en/open_orca_fastchat_ah.jsonl                        | 33891        |
+| en/sharegpt_v3_unfiltered_fastchat.jsonl              | 93880        |
+| en/ultrachat_200k_fastchat.jsonl                      | 11525        |
+| **total**                                             | **3457698**  |
+### German
+| Dataset file                                                | Sample Count |
+| ----------------------------------------------------------- | ------------ |
+| de/bactrianx_DE_fastchat.jsonl                              | 67017        |
+| de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl   | 49969        |
+| de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022        |
+| de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl      | 6101         |
+| de/german_poems_fastchat.jsonl                              | 400          |
+| de/german_songs_fastchat.jsonl                              | 1000         |
+| de/ultrachat_de_1k_fastchat.jsonl                           | 959          |
+| **total**                                                   | **184468**   |
 ### Training Procedure