hankun11 commited on
Commit
771977f
·
verified ·
1 Parent(s): dc9ef78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -48,13 +48,14 @@ tags:
48
  <a name="model-introduction"></a><br>
49
  # 1. Model Introduction
50
 
51
- - Orion-MOE8x7B-Base Large Language Model(LLM) is a pretrained generative Sparse Mixture of Experts, trained from scratch by OrionStarAI. The base model is trained on multilingual corpus, including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages.
52
 
53
- - The Orion-MOE8x7B series models exhibit the following features
54
- - The model demonstrates excellent performance in comprehensive evaluations compared to other base models of the same parameter scale.
55
- - It has strong multilingual capabilities, significantly leading in Japanese and Korean test sets, and also performing comprehensively better in Arabic, German, French, and Spanish test sets.
56
- - Model Hyper-Parameters
57
- - The architecture of the OrionMOE 8x7B models closely resembles that of Mixtral 8x7B, with specific details shown in the table below.
 
58
 
59
  |Configuration |OrionMOE 8x7B|
60
  |-------------------|-------------|
@@ -70,12 +71,13 @@ tags:
70
  |seq_len | 8192 |
71
  |Vocabulary Size | 1136664 |
72
 
73
- - Model pretrain hyper-parameters
74
  - We use the AdamW optimizer with hyperparameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and a weight decay of 0.1.
75
  - Training begins with a learning rate warm-up phase over 2000 iterations, where the learning rate is linearly increased to a peak of 3e-4. Afterward, a cosine schedule is applied to gradually reduce the learning rate to 3e-5 over the course of training.
76
  - The model is trained using BF16/FP32 mixed precision, with a batch size of 2600, processing approximately 22 million tokens per step.
77
- - Model pretrain data distribution
78
- - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
 
79
  <div align="center">
80
  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
81
  </div>
 
48
  <a name="model-introduction"></a><br>
49
  # 1. Model Introduction
50
 
51
+ - Orion-MOE8x7B is a pretrained foundation large language model with a sparse Mixture of Experts (MoE) architecture. The model is trained from scratch on a multilingual corpus comprising approximately 5 trillion tokens, including launguages such as Chinese, English, Japanese, Korean, and more.
52
 
53
+ - Key Features of Orion-MoE8x7B
54
+ - The model demonstrates exceptional performance in comprehensive evaluations compared to other models of the same parameter scale.
55
+ - The model excels in multilingual benchmarks, significantly outperforming in Japanese and Korean test sets, and also performing strong results in Arabic, German, French, and Spanish evaluations.
56
+ - Leveraging its sparse MoE structure, the model achieves faster inference speeds compared to dense models of similar scale.
57
+
58
+ - Model Architecture
59
 
60
  |Configuration |OrionMOE 8x7B|
61
  |-------------------|-------------|
 
71
  |seq_len | 8192 |
72
  |Vocabulary Size | 1136664 |
73
 
74
+ - Training hyper-parameters
75
  - We use the AdamW optimizer with hyperparameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and a weight decay of 0.1.
76
  - Training begins with a learning rate warm-up phase over 2000 iterations, where the learning rate is linearly increased to a peak of 3e-4. Afterward, a cosine schedule is applied to gradually reduce the learning rate to 3e-5 over the course of training.
77
  - The model is trained using BF16/FP32 mixed precision, with a batch size of 2600, processing approximately 22 million tokens per step.
78
+
79
+ - Data Distribution
80
+ - The training dataset is primarily composed of English, Chinese, which together account for over 75% of the total data. The remaining dataset includes other languages, programming code, mathematical data, etc. A detailed breakdown of the topic distribution is provided in the table below.
81
  <div align="center">
82
  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
83
  </div>