🌟 Overview

This is a slightly smaller model trained on half of the FastText dataset. Since Sinhala is a low-resource language, there’s a noticeable lack of pre-trained models available for it. 😕 This gap makes it harder to represent the language properly in the world of NLP.

But hey, that’s where this model comes in! 🚀 It opens up exciting opportunities to improve tasks like sentiment analysis, machine translation, named entity recognition, or even question answering—tailored just for Sinhala. 🇱🇰✨


🛠 Model Specs

Here’s what powers this model (we went with RoBERTa):

1️⃣ vocab_size = 25,000
2️⃣ max_position_embeddings = 514
3️⃣ num_attention_heads = 12
4️⃣ num_hidden_layers = 6
5️⃣ type_vocab_size = 1
🎯 Perplexity Value: 3.5


🚀 How to Use

You can jump right in and use this model for masked language modeling! 🧩

from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline

# Load the model and tokenizer
model = AutoModelWithLMHead.from_pretrained("ashenR/AshenBERTo")
tokenizer = AutoTokenizer.from_pretrained("ashenR/AshenBERTo")

# Create a fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Try it out with a Sinhala sentence! 🇱🇰
fill_mask("මම ගෙදර <mask>.")
Downloads last month
112
Safetensors
Model size
58.9M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Space using AshenR/AshenBERTo 1