# Model Card: banELECTRA-Base ## Model Details The **benElectra** model is a Bangla adaptation of **ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)**, a pre-training method for language models introduced by researchers at Google.ELECTRA uses a unique training strategy called **contrastive learning**, which differs from traditional masked language modeling (MLM) methods like BERT.After pre-training, only the discriminator is fine-tuned on downstream tasks, making **ELECTRA** a more efficient alternative to BERT, achieving higher performance with fewer parameters. The **banELECTRA-Base** model is tailored for Bangla text and fine-tuned for tasks like `Named Entity Recognition (NER), Part-of-Speech (POS) tagging,Sentence Similarity,Paraphrase Identification,etc.`The model was trained on two NVIDIA GeForce A40 GPUs. ## Training Data The **banELECTRA-Base** model was pre-trained on a **32 GB** Bangla text dataset. Below are the dataset statistics: - Total Words: ~1.996 billion - Unique Words: ~21.24 million - Total Sentences: ~165.38 million - Total Documents: ~15.62 million ## Model Architecture and Training The **benELECTRA** model was trained using the official [**ELECTRA repository**](https://huggingface.co/docs/transformers/en/model_doc/electra) with carefully selected hyperparameters to optimize performance for Bangla text. The model uses a vocabulary size of 50,000 tokens and consists of 12 hidden layers with 768 hidden dimensions and 12 attention heads in the discriminator. The generator is scaled to one-third the size of the discriminator, and training is conducted with a maximum sequence length of 256. The training employed a batch size of 96, a learning rate of 0.0004 with 10,000 warm-up steps, and a total of 1,000,000 training steps. Regularization techniques, such as a dropout rate of 0.1 and a weight decay of 0.01, were applied to improve generalization. ## How to Use ```bash from transformers import ElectraTokenizer, ElectraForSequenceClassification model_name = "banglagov/banELECTRA-Base" tokenizer = ElectraTokenizer.from_pretrained(model_name) model = ElectraForSequenceClassification.from_pretrained(model_name) text = "এর ফলে আগামী বছর বেকারত্বের হার বৃদ্ধি এবং অর্থনৈতিক মন্দার আশঙ্কায় ইউরোপীয় ইউনিয়ন ।" inputs = tokenizer(text, return_tensors="pt") print("Input Tokens ids:", inputs) ``` ## Experimental Results The **banELECTRA-Base** model demonstrates strong performance on downstream tasks, as shown below: | **Task** | **Precision** | **Recall** | **F1** | |-------------------------|---------------|------------|-----------| | **Named Entity Recognition (NER)** | 0.8842 | 0.7930 | 0.8249 | | **Part-of-Speech (POS) Tagging** | 0.8757 | 0.8717 | 0.8706 | Here we used **banELECTRA-Base** model with **Noisy Label** model architecture.