danielsteinigen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -32,7 +32,7 @@ library_name: transformers
|
|
32 |
base_model:
|
33 |
- openGPT-X/Teuken-7B-base-v0.4
|
34 |
---
|
35 |
-
# Model Card for Teuken-7B-instruct-v0.4
|
36 |
|
37 |
|
38 |
[Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
|
@@ -52,7 +52,7 @@ Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4]
|
|
52 |
## Uses
|
53 |
|
54 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
55 |
-
Teuken-7B-instruct-v0.4 is intended for
|
56 |
|
57 |
## Disclaimer Toxic Content:
|
58 |
|
@@ -69,7 +69,7 @@ The model is not intended for use in math and coding tasks.
|
|
69 |
|
70 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
71 |
|
72 |
-
Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
|
73 |
|
74 |
## How to Get Started with the Model
|
75 |
|
@@ -142,37 +142,48 @@ More information are available in our [preprint](http://arxiv.org/abs/2410.08800
|
|
142 |
|
143 |
### Instruction-Tuning Data
|
144 |
|
|
|
|
|
145 |
### English
|
146 |
|
147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
148 |
| ----------------------------------------------------- | ------------ |
|
149 |
-
|
|
150 |
-
|
|
151 |
-
|
|
152 |
-
|
|
153 |
-
|
|
154 |
-
|
|
155 |
-
|
|
156 |
-
|
|
157 |
-
|
|
158 |
-
| en/open_orca_fastchat_ag.jsonl | 599968 |
|
159 |
-
| en/open_orca_fastchat_ah.jsonl | 33891 |
|
160 |
-
| en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
|
161 |
-
| en/ultrachat_200k_fastchat.jsonl | 11525 |
|
162 |
-
| **total** | **3457698** |
|
163 |
|
164 |
### German
|
165 |
|
166 |
-
|
|
|
|
|
167 |
| ----------------------------------------------------------- | ------------ |
|
168 |
-
|
|
169 |
-
|
|
170 |
-
|
|
171 |
-
|
|
172 |
-
|
|
173 |
-
|
|
174 |
-
|
|
175 |
-
| **total** | **
|
|
|
176 |
|
177 |
### Training Procedure
|
178 |
|
|
|
32 |
base_model:
|
33 |
- openGPT-X/Teuken-7B-base-v0.4
|
34 |
---
|
35 |
+
# Model Card for Teuken-7B-instruct-research-v0.4
|
36 |
|
37 |
|
38 |
[Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
|
|
|
52 |
## Uses
|
53 |
|
54 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
55 |
+
Teuken-7B-instruct-research-v0.4 is intended for research use in all official 24 European languages. Since Teuken-7B-instruct-research-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
|
56 |
|
57 |
## Disclaimer Toxic Content:
|
58 |
|
|
|
69 |
|
70 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
71 |
|
72 |
+
Teuken-7B-instruct-research-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
|
73 |
|
74 |
## How to Get Started with the Model
|
75 |
|
|
|
142 |
|
143 |
### Instruction-Tuning Data
|
144 |
|
145 |
+
For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables.
|
146 |
+
|
147 |
### English
|
148 |
|
149 |
+
* We only included a subsample of the OpenOrca dataset.
|
150 |
+
* For the LMSYS-Chat dataset, we selected only the high-quality criteria in [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://arxiv.org/abs/2309.11998), i.e., if the model answer stems from any of "GPT-3.5-turbo", "GPT-4", "Claude-1", "Claude-instant-1" or "Claude-2" and is English.
|
151 |
+
* To select instruction-tuning examples based on their quality, We calculated the reward scores of all English examples utilizing [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) (Apache-2.0 license)
|
152 |
+
|
153 |
+
For English data, we did the following steps for sample selection:
|
154 |
+
1. Add all multi-turn examples
|
155 |
+
2. Add entire `code_alpaca` dataset subset
|
156 |
+
3. Add entire `lmsys_chat_1m_high_quality_train_en` dataset subset
|
157 |
+
4. For the remaining dataset subsets (`open_orca`, `evol_instruct_143k`, `evol_instruct_70k`, `sharegpt_v3`, `ultrachat_200k`, `bactrianx_EN`), we add the samples with the highest reward scores so that each dataset subset contributes an equal amount of high-quality examples
|
158 |
+
|
159 |
+
|
160 |
+
| Dataset | Sample Count |
|
161 |
| ----------------------------------------------------- | ------------ |
|
162 |
+
| anon8231489123/ShareGPT_Vicuna_unfiltered | 37.6K |
|
163 |
+
| MBZUAI/Bactrian-X | 26.9K |
|
164 |
+
| Open-Orca/OpenOrca | 26.9K |
|
165 |
+
| WizardLM/WizardLM_evol_instruct_70k | 26.9K |
|
166 |
+
| WizardLM/WizardLM_evol_instruct_V2_196k | 26.8K |
|
167 |
+
| sahil2801/CodeAlpaca-20k | 12.1K |
|
168 |
+
| lmsys/lmsys-chat-1m | 11.2K |
|
169 |
+
| HuggingFaceH4/ultrachat_200k | 7.0K |
|
170 |
+
| **total** | **175,5K** |
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
### German
|
173 |
|
174 |
+
For German data we include the complete data sets from the given table:
|
175 |
+
|
176 |
+
| Dataset | Sample Count |
|
177 |
| ----------------------------------------------------------- | ------------ |
|
178 |
+
| MBZUAI/Bactrian-X DE | 63.7K |
|
179 |
+
| FreedomIntelligence/evol-instruct-deutsch | 55.9K |
|
180 |
+
| FreedomIntelligence/alpaca-gpt4-deutsch | 47.5K |
|
181 |
+
| FreedomIntelligence/sharegpt-deutsch | 5.8K |
|
182 |
+
| LeoLM/German_Songs | 943 |
|
183 |
+
| LeoLM/German_Poems | 378 |
|
184 |
+
| bjoernp/ultrachat_de | 909 |
|
185 |
+
| **total** | **175,13K** |
|
186 |
+
|
187 |
|
188 |
### Training Procedure
|
189 |
|