Update README.md
Browse files
README.md
CHANGED
@@ -4,8 +4,32 @@ language:
|
|
4 |
- en
|
5 |
pipeline_tag: text-generation
|
6 |
---
|
7 |
-
Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
-
First attempt to build GPT from scratch. Used RoPE and GeGLU
|
10 |
|
11 |
|
|
|
4 |
- en
|
5 |
pipeline_tag: text-generation
|
6 |
---
|
7 |
+
# Maximum Language Model (218M)
|
8 |
+
|
9 |
+
A transformer-based language model inspired by GPT architecture, incorporating RoPE (Rotary Position Embeddings) and GeGLU (Gated Exponential Linear Unit) activations for enhanced performance.
|
10 |
+
|
11 |
+
## Model Specifications
|
12 |
+
|
13 |
+
- **Parameters**: 218M
|
14 |
+
- **Training Data**: 3M tokens
|
15 |
+
- **Key Features**:
|
16 |
+
- RoPE (Rotary Position Embeddings) for better position encoding
|
17 |
+
- GeGLU activation function for improved gradient flow
|
18 |
+
- Transformer-based architecture
|
19 |
+
|
20 |
+
|
21 |
+
### Position Embeddings
|
22 |
+
The model uses RoPE (Rotary Position Embeddings) instead of traditional positional encodings. RoPE enables:
|
23 |
+
- Better relative position modeling
|
24 |
+
- Enhanced extrapolation to longer sequences
|
25 |
+
- Theoretical backing for position-aware attention
|
26 |
+
|
27 |
+
### Activation Function
|
28 |
+
GeGLU (Gated Exponential Linear Unit) is used as the activation function, which:
|
29 |
+
- Provides better gradient flow during training
|
30 |
+
- Combines the benefits of gating mechanisms with ELU's properties
|
31 |
+
- Helps mitigate vanishing gradient problems
|
32 |
+
|
33 |
|
|
|
34 |
|
35 |
|