|
--- |
|
license: mit |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
# Maximum Language Model (218M) |
|
|
|
A transformer-based language model inspired by GPT architecture, incorporating RoPE (Rotary Position Embeddings) and GeGLU (Gated Exponential Linear Unit) activations for enhanced performance. |
|
|
|
## Model Specifications |
|
|
|
- **Parameters**: 218M |
|
- **Training Data**: 3M tokens |
|
- **Key Features**: |
|
- RoPE (Rotary Position Embeddings) for better position encoding |
|
- GeGLU activation function for improved gradient flow |
|
- Transformer-based architecture |
|
|
|
|
|
### Position Embeddings |
|
The model uses RoPE (Rotary Position Embeddings) instead of traditional positional encodings. RoPE enables: |
|
- Better relative position modeling |
|
- Enhanced extrapolation to longer sequences |
|
- Theoretical backing for position-aware attention |
|
|
|
### Activation Function |
|
GeGLU (Gated Exponential Linear Unit) is used as the activation function, which: |
|
- Provides better gradient flow during training |
|
- Combines the benefits of gating mechanisms with ELU's properties |
|
- Helps mitigate vanishing gradient problems |
|
|
|
|
|
|
|
|
|
|