Model Card: banT5-Base

Model Details

The banT5-Base model is a Bangla adaptation of the T5 (Text-To-Text Transfer Transformer) model, originally introduced by researchers at Google. T5 is a unified language model designed to frame all natural language processing (NLP) tasks as text-to-text problems. This allows the model to handle a variety of tasks by simply altering the input and output formats.

banT5-Base is specifically trained on a curated Bangla text corpus to deliver state-of-the-art performance in tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Question Answering,Paraphrase Identification,etc.

Training Data

The banT5-Base model was pre-trained on a large-scale Bangla text dataset, amounting to 27 GB of raw data. After cleaning and normalization, the processed dataset increased to 36 GB. Below is an overview of the data cardinalities:

Total Words: 1,646,252,743 (1.65 billion)
Unique Words: 15,223,848 (15.23 million)
Total Sentences: 131,412,177 (131.4 million)
Total Documents: 7,670,661 (7.67 million)

Model Architecture and Training

The banT5 model was trained using the Hugging Face Transformers library, leveraging the T5ForConditionalGeneration class. The model is configured with a vocabulary size of 50,100 tokens, 12 hidden layers in both the encoder and decoder, and 768 hidden dimensions. It uses multi-head attention with 12 attention heads and an intermediate feed-forward layer size of 3,072. The training setup includes 16-bit precision (fp16) for faster computation, a maximum sequence length of 256, and a batch size of 108 per device for both training and evaluation. Optimization is performed using the AdamW optimizer with β1 = 0.9, β2 = 0.98, ε = 1e-6, and a weight decay of 0.01. A learning rate of 0.00005 is used with a warmup ratio of 10%, and gradients are accumulated over one step. Dropout is applied at a rate of 0.1 for regularization. Training spans 1,000,000 steps, with memory pinning and last-batch dropping enabled in the data loaders for efficient data handling. Relative attention mechanisms, including 32 attention buckets and a maximum distance of 128 for longer sequences, are also incorporated to handle positional information effectively.

Using this model in `transformers`

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "banglagov/banT5-Base" 
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example input text
input_text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

input_ids = tokenizer.encode(input_text, return_tensors="pt")

print("input_ids :", input_ids)

Experimental Results

The banT5 model demonstrated strong performance on downstream tasks, as summarized below:

Task	Precision	Recall	F1
Named Entity Recognition (NER)	0.8882	0.8563	0.8686
Part-of-Speech (POS) Tagging	0.8813	0.8813	0.8791