# Model Card: banELECTRA-Base

## Model Details
The **benElectra** model is a Bangla adaptation of **ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)**, a pre-training method for language models introduced by researchers at Google.ELECTRA uses a unique training strategy called **contrastive learning**, which differs from traditional masked language modeling (MLM) methods like BERT.After pre-training, only the discriminator is fine-tuned on downstream tasks, making **ELECTRA** a more efficient alternative to BERT, achieving higher performance with fewer parameters.  
The **banELECTRA-Base** model is tailored for Bangla text and fine-tuned for tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging,Sentence Similarity,Paraphrase Identification,etc.`The model was trained on two NVIDIA GeForce A40 GPUs.
## Training Data
The **banELECTRA-Base** model was pre-trained on a **32 GB** Bangla text dataset. Below are the dataset statistics:  
- Total Words: ~1.996 billion
- Unique Words: ~21.24 million
- Total Sentences: ~165.38 million
- Total Documents: ~15.62 million

## Model Architecture and Training

The **benELECTRA** model was trained using the official [**ELECTRA repository**](https://huggingface.co/docs/transformers/en/model_doc/electra) with carefully selected hyperparameters to optimize performance for Bangla text. The model uses a vocabulary size of 50,000 tokens and consists of 12 hidden layers with 768 hidden dimensions and 12 attention heads in the discriminator. The generator is scaled to one-third the size of the discriminator, and training is conducted with a maximum sequence length of 256. The training employed a batch size of 96, a learning rate of 0.0004 with 10,000 warm-up steps, and a total of 1,000,000 training steps. Regularization techniques, such as a dropout rate of 0.1 and a weight decay of 0.01, were applied to improve generalization. 
## How to Use
```bash
from transformers import ElectraTokenizer, ElectraForSequenceClassification

model_name = "banglagov/banELECTRA-Base"  
tokenizer = ElectraTokenizer.from_pretrained(model_name)
model = ElectraForSequenceClassification.from_pretrained(model_name)

text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।"

inputs = tokenizer(text, return_tensors="pt")

print("Input Tokens ids:", inputs)

```
## Experimental Results
The **banELECTRA-Base** model demonstrates strong performance on downstream tasks, as shown below:

| **Task**               | **Precision** | **Recall** | **F1**    |
|-------------------------|---------------|------------|-----------|
| **Named Entity Recognition (NER)** | 0.8842        | 0.7930     | 0.8249    |
| **Part-of-Speech (POS) Tagging**    | 0.8757        | 0.8717     | 0.8706    |

Here we used  **banELECTRA-Base** model with **Noisy Label** model architecture.