Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: yi-license
|
4 |
+
license_link: LICENSE
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
- ko
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
inference: false
|
10 |
+
tags:
|
11 |
+
- pytorch
|
12 |
+
- Yi-Ko
|
13 |
+
- 01-ai
|
14 |
+
- Yi
|
15 |
+
library_name: transformers
|
16 |
+
---
|
17 |
+
# Yi Ko 34B Instruct
|
18 |
+
|
19 |
+
## Training Process
|
20 |
+
|
21 |
+
1. Further trained with Korean corpus.
|
22 |
+
2. SFT
|
23 |
+
3. DPO [(Dataset URL)](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized)
|
24 |
+
|
25 |
+
## Model Info
|
26 |
+
|
27 |
+
| Context Length | Parameter | Prompt Template | MMLU(5-shot) |
|
28 |
+
| --- | --- | --- | --- |
|
29 |
+
| 4k(4096) | 34B | ChatML | Partly | 49.03 |
|
30 |
+
|
31 |
+
# Original Model Card by [beomi](https://huggingface.co/beomi)
|
32 |
+
|
33 |
+
Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
|
34 |
+
benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
|
35 |
+
Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
|
36 |
+
This repository focuses on the **34B** pretrained version,
|
37 |
+
which is tailored to fit the Hugging Face Transformers format.
|
38 |
+
For access to the other models, feel free to consult the index provided below.
|
39 |
+
|
40 |
+
## Model Details
|
41 |
+
|
42 |
+
**Model Developers** Junbum Lee (Beomi)
|
43 |
+
|
44 |
+
**Variations** Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean+English).
|
45 |
+
|
46 |
+
**Input** Models input text only.
|
47 |
+
|
48 |
+
**Output** Models generate text only.
|
49 |
+
|
50 |
+
**Model Architecture**
|
51 |
+
|
52 |
+
Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
|
53 |
+
|
54 |
+
<small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
|
55 |
+
|
56 |
+
|Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Train tokens (per batch)|
|
57 |
+
|---|---|---|---|---|---|---|---|
|
58 |
+
|Yi-Ko-34B|*A mix of Korean + English online data*|34B|4k|O|40B+|5e<sup>-5</sup>|4M|
|
59 |
+
|
60 |
+
**Vocab Expansion**
|
61 |
+
|
62 |
+
| Model Name | Vocabulary Size | Description |
|
63 |
+
| --- | --- | --- |
|
64 |
+
| Original Yi-Series | 64000 | Sentencepiece BPE |
|
65 |
+
| **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
|
66 |
+
|
67 |
+
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
|
68 |
+
|
69 |
+
| Model | # of tokens | Tokens |
|
70 |
+
| --- | --- | --- |
|
71 |
+
| Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
|
72 |
+
| **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
|
73 |
+
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
|
74 |
+
|
75 |
+
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
|
76 |
+
|
77 |
+
| Model | # of tokens | Tokens |
|
78 |
+
| --- | --- | --- |
|
79 |
+
| Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
|
80 |
+
| **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
|
81 |
+
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
|
82 |
+
|
83 |
+
# **Model Benchmark**
|
84 |
+
|
85 |
+
## LM Eval Harness - Korean Benchmarks
|
86 |
+
|
87 |
+
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|
88 |
+
|----------------|------:|------|-----:|--------|-----:|---|------|
|
89 |
+
|**kmmlu_direct**|N/A |none | 5|exact_match|**0.5027**|± |0.1019|
|
90 |
+
|kobest_boolq | 1|none | 5|acc |0.9202|± |0.0072|
|
91 |
+
| | |none | 5|f1 |0.9202|± |N/A |
|
92 |
+
|kobest_copa | 1|none | 5|acc |0.8480|± |0.0114|
|
93 |
+
| | |none | 5|f1 |0.8479|± |N/A |
|
94 |
+
|kobest_hellaswag| 1|none | 5|acc |0.5320|± |0.0223|
|
95 |
+
| | |none | 5|f1 |0.5281|± |N/A |
|
96 |
+
| | |none | 5|acc_norm|0.6340|± |0.0216|
|
97 |
+
|kobest_sentineg | 1|none | 5|acc |0.9874|± |0.0056|
|
98 |
+
| | |none | 5|f1 |0.9874|± |N/A |
|
99 |
+
|haerae |N/A |none | 5|acc |0.7965|± |0.0116|
|
100 |
+
| | |none | 5|acc_norm|0.7965|± |0.0116|
|
101 |
+
| - haerae_general_knowledge | 1|none | 5|acc |0.5114|± |0.0378|
|
102 |
+
| | |none | 5|acc_norm|0.5114|± |0.0378|
|
103 |
+
| - haerae_history | 1|none | 5|acc |0.8511|± |0.0260|
|
104 |
+
| | |none | 5|acc_norm|0.8511|± |0.0260|
|
105 |
+
| - haerae_loan_word | 1|none | 5|acc |0.8402|± |0.0283|
|
106 |
+
| | |none | 5|acc_norm|0.8402|± |0.0283|
|
107 |
+
| - haerae_rare_word | 1|none | 5|acc |0.8642|± |0.0170|
|
108 |
+
| | |none | 5|acc_norm|0.8642|± |0.0170|
|
109 |
+
| - haerae_standard_nomenclature| 1|none | 5|acc |0.8301|± |0.0305|
|
110 |
+
| | |none | 5|acc_norm|0.8301|± |0.0305|
|
111 |
+
|
112 |
+
## LICENSE
|
113 |
+
|
114 |
+
Follows Yi License
|
115 |
+
|
116 |
+
## Citation
|
117 |
+
|
118 |
+
|
119 |
+
|
120 |
+
## Acknowledgement
|
121 |
+
|
122 |
+
The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
|