Maximum-218M / README.md
InHUMAN's picture
Update README.md
4112c84 verified
|
raw
history blame
1.09 kB
metadata
license: mit
language:
  - en
pipeline_tag: text-generation

Maximum Language Model (218M)

A transformer-based language model inspired by GPT architecture, incorporating RoPE (Rotary Position Embeddings) and GeGLU (Gated Exponential Linear Unit) activations for enhanced performance.

Model Specifications

  • Parameters: 218M
  • Training Data: 3M tokens
  • Key Features:
    • RoPE (Rotary Position Embeddings) for better position encoding
    • GeGLU activation function for improved gradient flow
    • Transformer-based architecture

Position Embeddings

The model uses RoPE (Rotary Position Embeddings) instead of traditional positional encodings. RoPE enables:

  • Better relative position modeling
  • Enhanced extrapolation to longer sequences
  • Theoretical backing for position-aware attention

Activation Function

GeGLU (Gated Exponential Linear Unit) is used as the activation function, which:

  • Provides better gradient flow during training
  • Combines the benefits of gating mechanisms with ELU's properties
  • Helps mitigate vanishing gradient problems