Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
danielsteinigen commited on
Commit
39302de
·
verified ·
1 Parent(s): 8e19e38

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -27
README.md CHANGED
@@ -32,7 +32,7 @@ library_name: transformers
32
  base_model:
33
  - openGPT-X/Teuken-7B-base-v0.4
34
  ---
35
- # Model Card for Teuken-7B-instruct-v0.4
36
 
37
 
38
  [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
@@ -52,7 +52,7 @@ Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4]
52
  ## Uses
53
 
54
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
55
- Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
56
 
57
  ## Disclaimer Toxic Content:
58
 
@@ -69,7 +69,7 @@ The model is not intended for use in math and coding tasks.
69
 
70
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
71
 
72
- Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
73
 
74
  ## How to Get Started with the Model
75
 
@@ -142,37 +142,48 @@ More information are available in our [preprint](http://arxiv.org/abs/2410.08800
142
 
143
  ### Instruction-Tuning Data
144
 
 
 
145
  ### English
146
 
147
- | Dataset file | Sample Count |
 
 
 
 
 
 
 
 
 
 
 
148
  | ----------------------------------------------------- | ------------ |
149
- | en/bactrianx_EN_fastchat.jsonl | 66985 |
150
- | en/code_alpaca_fastchat.jsonl | 19990 |
151
- | en/evol_instruct_143k_fastchat.jsonl | 142968 |
152
- | en/evol_instruct_70k_fastchat.jsonl | 69968 |
153
- | en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651 |
154
- | en/open_orca_fastchat_aa.jsonl | 599968 |
155
- | en/open_orca_fastchat_ab.jsonl | 599968 |
156
- | en/open_orca_fastchat_ac.jsonl | 599968 |
157
- | en/open_orca_fastchat_ad.jsonl | 599968 |
158
- | en/open_orca_fastchat_ag.jsonl | 599968 |
159
- | en/open_orca_fastchat_ah.jsonl | 33891 |
160
- | en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
161
- | en/ultrachat_200k_fastchat.jsonl | 11525 |
162
- | **total** | **3457698** |
163
 
164
  ### German
165
 
166
- | Dataset file | Sample Count |
 
 
167
  | ----------------------------------------------------------- | ------------ |
168
- | de/bactrianx_DE_fastchat.jsonl | 67017 |
169
- | de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl | 49969 |
170
- | de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022 |
171
- | de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl | 6101 |
172
- | de/german_poems_fastchat.jsonl | 400 |
173
- | de/german_songs_fastchat.jsonl | 1000 |
174
- | de/ultrachat_de_1k_fastchat.jsonl | 959 |
175
- | **total** | **184468** |
 
176
 
177
  ### Training Procedure
178
 
 
32
  base_model:
33
  - openGPT-X/Teuken-7B-base-v0.4
34
  ---
35
+ # Model Card for Teuken-7B-instruct-research-v0.4
36
 
37
 
38
  [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
 
52
  ## Uses
53
 
54
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
55
+ Teuken-7B-instruct-research-v0.4 is intended for research use in all official 24 European languages. Since Teuken-7B-instruct-research-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
56
 
57
  ## Disclaimer Toxic Content:
58
 
 
69
 
70
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
71
 
72
+ Teuken-7B-instruct-research-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
73
 
74
  ## How to Get Started with the Model
75
 
 
142
 
143
  ### Instruction-Tuning Data
144
 
145
+ For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables.
146
+
147
  ### English
148
 
149
+ * We only included a subsample of the OpenOrca dataset.
150
+ * For the LMSYS-Chat dataset, we selected only the high-quality criteria in [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://arxiv.org/abs/2309.11998), i.e., if the model answer stems from any of "GPT-3.5-turbo", "GPT-4", "Claude-1", "Claude-instant-1" or "Claude-2" and is English.
151
+ * To select instruction-tuning examples based on their quality, We calculated the reward scores of all English examples utilizing [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) (Apache-2.0 license)
152
+
153
+ For English data, we did the following steps for sample selection:
154
+ 1. Add all multi-turn examples
155
+ 2. Add entire `code_alpaca` dataset subset
156
+ 3. Add entire `lmsys_chat_1m_high_quality_train_en` dataset subset
157
+ 4. For the remaining dataset subsets (`open_orca`, `evol_instruct_143k`, `evol_instruct_70k`, `sharegpt_v3`, `ultrachat_200k`, `bactrianx_EN`), we add the samples with the highest reward scores so that each dataset subset contributes an equal amount of high-quality examples
158
+
159
+
160
+ | Dataset | Sample Count |
161
  | ----------------------------------------------------- | ------------ |
162
+ | anon8231489123/ShareGPT_Vicuna_unfiltered | 37.6K |
163
+ | MBZUAI/Bactrian-X | 26.9K |
164
+ | Open-Orca/OpenOrca | 26.9K |
165
+ | WizardLM/WizardLM_evol_instruct_70k | 26.9K |
166
+ | WizardLM/WizardLM_evol_instruct_V2_196k | 26.8K |
167
+ | sahil2801/CodeAlpaca-20k | 12.1K |
168
+ | lmsys/lmsys-chat-1m | 11.2K |
169
+ | HuggingFaceH4/ultrachat_200k | 7.0K |
170
+ | **total** | **175,5K** |
 
 
 
 
 
171
 
172
  ### German
173
 
174
+ For German data we include the complete data sets from the given table:
175
+
176
+ | Dataset | Sample Count |
177
  | ----------------------------------------------------------- | ------------ |
178
+ | MBZUAI/Bactrian-X DE | 63.7K |
179
+ | FreedomIntelligence/evol-instruct-deutsch | 55.9K |
180
+ | FreedomIntelligence/alpaca-gpt4-deutsch | 47.5K |
181
+ | FreedomIntelligence/sharegpt-deutsch | 5.8K |
182
+ | LeoLM/German_Songs | 943 |
183
+ | LeoLM/German_Poems | 378 |
184
+ | bjoernp/ultrachat_de | 909 |
185
+ | **total** | **175,13K** |
186
+
187
 
188
  ### Training Procedure
189