renillhuang
commited on
Commit
·
4871737
1
Parent(s):
301ba04
readme: Modify benchmark tables
Browse filesSigned-off-by: eric <[email protected]>
- README.md +27 -27
- README_zh.md +29 -27
README.md
CHANGED
@@ -68,25 +68,26 @@ Model release and download links are provided in the table below:
|
|
68 |
|
69 |
## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
|
70 |
### 3.1.1. LLM evaluation results on examination and professional knowledge
|
71 |
-
|
|
72 |
-
|
73 |
-
|
|
74 |
-
|
|
75 |
-
|
|
76 |
-
|
|
77 |
-
|
78 |
-
|
79 |
-
|
|
80 |
-
|
|
81 |
-
|
|
82 |
-
|
|
83 |
-
|
84 |
-
|
85 |
-
|
|
86 |
-
|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
|
|
|
90 |
|
91 |
### 3.1.2. Comparison of LLM performances on Japanese testsets
|
92 |
| Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
|
@@ -95,7 +96,7 @@ Model release and download links are provided in the table below:
|
|
95 |
|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
|
96 |
|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
|
97 |
|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
|
98 |
-
|
99 |
|
100 |
### 3.1.3. Comparison of LLM performances on Korean testsets
|
101 |
|Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
|
@@ -104,7 +105,7 @@ Model release and download links are provided in the table below:
|
|
104 |
|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
|
105 |
|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
|
106 |
|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
|
107 |
-
|
108 |
|
109 |
### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
|
110 |
| Lang | ar | | de | | fr | | es | |
|
@@ -114,21 +115,20 @@ Model release and download links are provided in the table below:
|
|
114 |
|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
|
115 |
|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
|
116 |
|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
|
117 |
-
|
118 |
|
119 |
### 3.1.5. Leakage Detection Benchmark
|
120 |
The proportion of leakage data(from various evaluation benchmarks) in the pre-trained corpus; the higher the proportion, the more leakage it indicates.
|
121 |
- Code: https://github.com/nishiwen1214/Benchmark-leakage-detection
|
122 |
- Paper: https://web3.arxiv.org/pdf/2409.01790
|
123 |
-
- Blog: https://mp.weixin.qq.com/s/BtcJmDEUyzAYG-fqCal2lA
|
124 |
- English Test: mmlu
|
125 |
- Chinese Test: ceval, cmmlu
|
126 |
|
127 |
|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
|
128 |
-
|
129 |
-
|mmlu | 0.3 | 0.27
|
130 |
-
|ceval | 0.39 | 0.38
|
131 |
-
|cmmlu | 0.38 | 0.39
|
132 |
|
133 |
### 3.1.6. Inference speed
|
134 |
Based on 8x Nvidia RTX3090, in unit of tokens per second.
|
|
|
68 |
|
69 |
## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
|
70 |
### 3.1.1. LLM evaluation results on examination and professional knowledge
|
71 |
+
|TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
|
72 |
+
| -- | -- | -- | -- | -- | -- |
|
73 |
+
|ceval | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
|
74 |
+
|cmmlu | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
|
75 |
+
|mmlu | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
|
76 |
+
|mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
|
77 |
+
|ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
|
78 |
+
|hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
|
79 |
+
|lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
|
80 |
+
|bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
|
81 |
+
|musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
|
82 |
+
|piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
|
83 |
+
|commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
|
84 |
+
|IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
|
85 |
+
|GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
|
86 |
+
|human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
|
87 |
+
|MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
|
88 |
+
|math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
|
89 |
+
|gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
|
90 |
+
|math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
|
91 |
|
92 |
### 3.1.2. Comparison of LLM performances on Japanese testsets
|
93 |
| Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
|
|
|
96 |
|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
|
97 |
|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
|
98 |
|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
|
99 |
+
|Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
|
100 |
|
101 |
### 3.1.3. Comparison of LLM performances on Korean testsets
|
102 |
|Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
|
|
|
105 |
|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
|
106 |
|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
|
107 |
|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
|
108 |
+
|Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
|
109 |
|
110 |
### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
|
111 |
| Lang | ar | | de | | fr | | es | |
|
|
|
115 |
|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
|
116 |
|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
|
117 |
|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
|
118 |
+
|Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
|
119 |
|
120 |
### 3.1.5. Leakage Detection Benchmark
|
121 |
The proportion of leakage data(from various evaluation benchmarks) in the pre-trained corpus; the higher the proportion, the more leakage it indicates.
|
122 |
- Code: https://github.com/nishiwen1214/Benchmark-leakage-detection
|
123 |
- Paper: https://web3.arxiv.org/pdf/2409.01790
|
|
|
124 |
- English Test: mmlu
|
125 |
- Chinese Test: ceval, cmmlu
|
126 |
|
127 |
|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
|
128 |
+
|------|------|------|------|------|------|
|
129 |
+
|mmlu | 0.3 | 0.27 | 0.22 | 0.28 | 0.25 |
|
130 |
+
|ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
|
131 |
+
|cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
132 |
|
133 |
### 3.1.6. Inference speed
|
134 |
Based on 8x Nvidia RTX3090, in unit of tokens per second.
|
README_zh.md
CHANGED
@@ -65,25 +65,27 @@
|
|
65 |
## 3.1. 基座模型Orion-MOE8x7B-Base评估
|
66 |
|
67 |
### 3.1.1. 基座模型基准测试对比
|
68 |
-
|
|
69 |
-
|
70 |
-
|
|
71 |
-
|
|
72 |
-
|
|
73 |
-
|
|
74 |
-
|
75 |
-
|
76 |
-
|
|
77 |
-
|
|
78 |
-
|
|
79 |
-
|
|
80 |
-
|
81 |
-
|
82 |
-
|
|
83 |
-
|
|
84 |
-
|
|
85 |
-
|
|
86 |
-
|
|
|
|
|
87 |
|
88 |
|
89 |
### 3.1.2. 小语种: 日文
|
@@ -93,7 +95,7 @@
|
|
93 |
|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
|
94 |
|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
|
95 |
|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
|
96 |
-
|
97 |
|
98 |
|
99 |
### 3.1.3. 小语种: 韩文
|
@@ -103,33 +105,33 @@
|
|
103 |
|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
|
104 |
|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
|
105 |
|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
|
106 |
-
|
|
|
107 |
|
108 |
|
109 |
### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
|
110 |
| Lang | ar | | de | | fr | | es | |
|
111 |
-
|
112 |
|**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
|
113 |
|Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
|
114 |
|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
|
115 |
|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
|
116 |
|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
|
117 |
-
|
118 |
|
119 |
|
120 |
### 3.1.5. 泄漏检测结果
|
121 |
检测测试题目的泄露程度,值越大泄露的越严重
|
122 |
- 检测代码: https://github.com/nishiwen1214/Benchmark-leakage-detection
|
123 |
- 论文: https://web3.arxiv.org/pdf/2409.01790
|
124 |
-
- 博客: https://mp.weixin.qq.com/s/BtcJmDEUyzAYG-fqCal2lA
|
125 |
- 英文测试:mmlu
|
126 |
- 中文测试:ceval, cmmlu
|
127 |
|
128 |
|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
|
129 |
|----|----|----|----|----|----|
|
130 |
-
|mmlu | 0.3 | 0.27
|
131 |
-
|ceval | 0.39 | 0.38
|
132 |
-
|cmmlu | 0.38 | 0.39
|
133 |
|
134 |
|
135 |
### 3.1.6. 推理速度
|
|
|
65 |
## 3.1. 基座模型Orion-MOE8x7B-Base评估
|
66 |
|
67 |
### 3.1.1. 基座模型基准测试对比
|
68 |
+
|TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
|
69 |
+
| -- | -- | -- | -- | -- | -- |
|
70 |
+
|ceval | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
|
71 |
+
|cmmlu | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
|
72 |
+
|mmlu | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
|
73 |
+
|mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
|
74 |
+
|ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
|
75 |
+
|hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
|
76 |
+
|lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
|
77 |
+
|bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
|
78 |
+
|musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
|
79 |
+
|piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
|
80 |
+
|commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
|
81 |
+
|IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
|
82 |
+
|GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
|
83 |
+
|human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
|
84 |
+
|MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
|
85 |
+
|math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
|
86 |
+
|gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
|
87 |
+
|math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
|
88 |
+
|
89 |
|
90 |
|
91 |
### 3.1.2. 小语种: 日文
|
|
|
95 |
|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
|
96 |
|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
|
97 |
|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
|
98 |
+
|Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
|
99 |
|
100 |
|
101 |
### 3.1.3. 小语种: 韩文
|
|
|
105 |
|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
|
106 |
|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
|
107 |
|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
|
108 |
+
|Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
|
109 |
+
|
110 |
|
111 |
|
112 |
### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
|
113 |
| Lang | ar | | de | | fr | | es | |
|
114 |
+
|------|----|--|----|--|----|--|----|--|
|
115 |
|**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
|
116 |
|Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
|
117 |
|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
|
118 |
|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
|
119 |
|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
|
120 |
+
|Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
|
121 |
|
122 |
|
123 |
### 3.1.5. 泄漏检测结果
|
124 |
检测测试题目的泄露程度,值越大泄露的越严重
|
125 |
- 检测代码: https://github.com/nishiwen1214/Benchmark-leakage-detection
|
126 |
- 论文: https://web3.arxiv.org/pdf/2409.01790
|
|
|
127 |
- 英文测试:mmlu
|
128 |
- 中文测试:ceval, cmmlu
|
129 |
|
130 |
|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
|
131 |
|----|----|----|----|----|----|
|
132 |
+
|mmlu | 0.3 | 0.27 | 0.22 | 0.28 | 0.25 |
|
133 |
+
|ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
|
134 |
+
|cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
135 |
|
136 |
|
137 |
### 3.1.6. 推理速度
|