Maximum Language Model (218M)

A transformer-based language model inspired by GPT architecture, incorporating RoPE (Rotary Position Embeddings) and GeGLU (Gated Exponential Linear Unit) activations for enhanced performance.

Model Specifications

  • Parameters: 218M
  • Training Data: 3M tokens
  • Key Features:
    • RoPE (Rotary Position Embeddings) for better position encoding
    • GeGLU activation function for improved gradient flow
    • Transformer-based architecture

Position Embeddings

The model uses RoPE (Rotary Position Embeddings) instead of traditional positional encodings. RoPE enables:

  • Better relative position modeling
  • Enhanced extrapolation to longer sequences
  • Theoretical backing for position-aware attention

Activation Function

GeGLU (Gated Exponential Linear Unit) is used as the activation function, which:

  • Provides better gradient flow during training
  • Combines the benefits of gating mechanisms with ELU's properties
  • Helps mitigate vanishing gradient problems

Additional Info:

https://github.com/KarthikDevalla/Maximum-218M

Downloads last month
75
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .