Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
mfromm commited on
Commit
8204548
·
verified ·
1 Parent(s): 4acfc1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -5
README.md CHANGED
@@ -139,12 +139,40 @@ Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly a
139
  The pretraining data has a cutoff of September 2023.
140
  More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
141
 
142
- For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
143
- 1. Add all multi-turn examples
144
- 2. Add the entire code_alpaca dataset subset
145
- 3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
146
- 4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
  ### Training Procedure
150
 
 
139
  The pretraining data has a cutoff of September 2023.
140
  More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
141
 
 
 
 
 
 
142
 
143
+ ### Instruction-Tuning Data
144
+
145
+ ### English
146
+
147
+ | Dataset file | Sample Count |
148
+ | ----------------------------------------------------- | ------------ |
149
+ | en/bactrianx_EN_fastchat.jsonl | 66985 |
150
+ | en/code_alpaca_fastchat.jsonl | 19990 |
151
+ | en/evol_instruct_143k_fastchat.jsonl | 142968 |
152
+ | en/evol_instruct_70k_fastchat.jsonl | 69968 |
153
+ | en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651 |
154
+ | en/open_orca_fastchat_aa.jsonl | 599968 |
155
+ | en/open_orca_fastchat_ab.jsonl | 599968 |
156
+ | en/open_orca_fastchat_ac.jsonl | 599968 |
157
+ | en/open_orca_fastchat_ad.jsonl | 599968 |
158
+ | en/open_orca_fastchat_ag.jsonl | 599968 |
159
+ | en/open_orca_fastchat_ah.jsonl | 33891 |
160
+ | en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
161
+ | en/ultrachat_200k_fastchat.jsonl | 11525 |
162
+ | **total** | **3457698** |
163
+
164
+ ### German
165
+
166
+ | Dataset file | Sample Count |
167
+ | ----------------------------------------------------------- | ------------ |
168
+ | de/bactrianx_DE_fastchat.jsonl | 67017 |
169
+ | de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl | 49969 |
170
+ | de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022 |
171
+ | de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl | 6101 |
172
+ | de/german_poems_fastchat.jsonl | 400 |
173
+ | de/german_songs_fastchat.jsonl | 1000 |
174
+ | de/ultrachat_de_1k_fastchat.jsonl | 959 |
175
+ | **total** | **184468** |
176
 
177
  ### Training Procedure
178