Update README.md
Browse files
README.md
CHANGED
@@ -48,13 +48,14 @@ tags:
|
|
48 |
<a name="model-introduction"></a><br>
|
49 |
# 1. Model Introduction
|
50 |
|
51 |
-
- Orion-MOE8x7B
|
52 |
|
53 |
-
-
|
54 |
-
- The model demonstrates
|
55 |
-
-
|
56 |
-
-
|
57 |
-
|
|
|
58 |
|
59 |
|Configuration |OrionMOE 8x7B|
|
60 |
|-------------------|-------------|
|
@@ -70,12 +71,13 @@ tags:
|
|
70 |
|seq_len | 8192 |
|
71 |
|Vocabulary Size | 1136664 |
|
72 |
|
73 |
-
-
|
74 |
- We use the AdamW optimizer with hyperparameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and a weight decay of 0.1.
|
75 |
- Training begins with a learning rate warm-up phase over 2000 iterations, where the learning rate is linearly increased to a peak of 3e-4. Afterward, a cosine schedule is applied to gradually reduce the learning rate to 3e-5 over the course of training.
|
76 |
- The model is trained using BF16/FP32 mixed precision, with a batch size of 2600, processing approximately 22 million tokens per step.
|
77 |
-
|
78 |
-
|
|
|
79 |
<div align="center">
|
80 |
<img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
|
81 |
</div>
|
|
|
48 |
<a name="model-introduction"></a><br>
|
49 |
# 1. Model Introduction
|
50 |
|
51 |
+
- Orion-MOE8x7B is a pretrained foundation large language model with a sparse Mixture of Experts (MoE) architecture. The model is trained from scratch on a multilingual corpus comprising approximately 5 trillion tokens, including launguages such as Chinese, English, Japanese, Korean, and more.
|
52 |
|
53 |
+
- Key Features of Orion-MoE8x7B
|
54 |
+
- The model demonstrates exceptional performance in comprehensive evaluations compared to other models of the same parameter scale.
|
55 |
+
- The model excels in multilingual benchmarks, significantly outperforming in Japanese and Korean test sets, and also performing strong results in Arabic, German, French, and Spanish evaluations.
|
56 |
+
- Leveraging its sparse MoE structure, the model achieves faster inference speeds compared to dense models of similar scale.
|
57 |
+
|
58 |
+
- Model Architecture
|
59 |
|
60 |
|Configuration |OrionMOE 8x7B|
|
61 |
|-------------------|-------------|
|
|
|
71 |
|seq_len | 8192 |
|
72 |
|Vocabulary Size | 1136664 |
|
73 |
|
74 |
+
- Training hyper-parameters
|
75 |
- We use the AdamW optimizer with hyperparameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and a weight decay of 0.1.
|
76 |
- Training begins with a learning rate warm-up phase over 2000 iterations, where the learning rate is linearly increased to a peak of 3e-4. Afterward, a cosine schedule is applied to gradually reduce the learning rate to 3e-5 over the course of training.
|
77 |
- The model is trained using BF16/FP32 mixed precision, with a batch size of 2600, processing approximately 22 million tokens per step.
|
78 |
+
|
79 |
+
- Data Distribution
|
80 |
+
- The training dataset is primarily composed of English, Chinese, which together account for over 75% of the total data. The remaining dataset includes other languages, programming code, mathematical data, etc. A detailed breakdown of the topic distribution is provided in the table below.
|
81 |
<div align="center">
|
82 |
<img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
|
83 |
</div>
|