Update README.md
Browse files
README.md
CHANGED
@@ -139,12 +139,40 @@ Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly a
|
|
139 |
The pretraining data has a cutoff of September 2023.
|
140 |
More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
|
141 |
|
142 |
-
For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
|
143 |
-
1. Add all multi-turn examples
|
144 |
-
2. Add the entire code_alpaca dataset subset
|
145 |
-
3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
|
146 |
-
4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
|
147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
148 |
|
149 |
### Training Procedure
|
150 |
|
|
|
139 |
The pretraining data has a cutoff of September 2023.
|
140 |
More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
|
141 |
|
|
|
|
|
|
|
|
|
|
|
142 |
|
143 |
+
### Instruction-Tuning Data
|
144 |
+
|
145 |
+
### English
|
146 |
+
|
147 |
+
| Dataset file | Sample Count |
|
148 |
+
| ----------------------------------------------------- | ------------ |
|
149 |
+
| en/bactrianx_EN_fastchat.jsonl | 66985 |
|
150 |
+
| en/code_alpaca_fastchat.jsonl | 19990 |
|
151 |
+
| en/evol_instruct_143k_fastchat.jsonl | 142968 |
|
152 |
+
| en/evol_instruct_70k_fastchat.jsonl | 69968 |
|
153 |
+
| en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651 |
|
154 |
+
| en/open_orca_fastchat_aa.jsonl | 599968 |
|
155 |
+
| en/open_orca_fastchat_ab.jsonl | 599968 |
|
156 |
+
| en/open_orca_fastchat_ac.jsonl | 599968 |
|
157 |
+
| en/open_orca_fastchat_ad.jsonl | 599968 |
|
158 |
+
| en/open_orca_fastchat_ag.jsonl | 599968 |
|
159 |
+
| en/open_orca_fastchat_ah.jsonl | 33891 |
|
160 |
+
| en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
|
161 |
+
| en/ultrachat_200k_fastchat.jsonl | 11525 |
|
162 |
+
| **total** | **3457698** |
|
163 |
+
|
164 |
+
### German
|
165 |
+
|
166 |
+
| Dataset file | Sample Count |
|
167 |
+
| ----------------------------------------------------------- | ------------ |
|
168 |
+
| de/bactrianx_DE_fastchat.jsonl | 67017 |
|
169 |
+
| de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl | 49969 |
|
170 |
+
| de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022 |
|
171 |
+
| de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl | 6101 |
|
172 |
+
| de/german_poems_fastchat.jsonl | 400 |
|
173 |
+
| de/german_songs_fastchat.jsonl | 1000 |
|
174 |
+
| de/ultrachat_de_1k_fastchat.jsonl | 959 |
|
175 |
+
| **total** | **184468** |
|
176 |
|
177 |
### Training Procedure
|
178 |
|