metadata

library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
model-index:
  - name: KoModernBERT-base-mlm-v02-ckp02
    results: []
language:
  - ko

KoModernBERT-base-mlm-v02

This model is a fine-tuned version of answerdotai/ModernBERT-base

Flash-Attention 2
StabelAdamW
Unpadding & Sequence Packing

It achieves the following results on the evaluation set:

Loss: 1.6437

Example Use

from transformers import AutoTokenizer, AutoModelForMaskedLM
from huggingface_hub import HfApi, login
with open('./api_key/HGF_TOKEN.txt', 'r') as hgf:
    login(token=hgf.read())
api = HfApi()

model_id = "x2bee/KoModernBERT-base-mlm-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id).to("cuda")

def modern_bert_convert_with_multiple_masks(text: str, top_k: int = 1, select_method:str = "Logit") -> str:
    if "[MASK]" not in text:
        raise ValueError("MLM Model should include '[MASK]' in the sentence")

    while "[MASK]" in text:
        inputs = tokenizer(text, return_tensors="pt").to("cuda")
        outputs = model(**inputs)

        input_ids = inputs["input_ids"][0].tolist()
        mask_indices = [i for i, token_id in enumerate(input_ids) if token_id == tokenizer.mask_token_id]

        current_mask_index = mask_indices[0]

        logits = outputs.logits[0, current_mask_index]

        top_k_tokens = logits.topk(top_k).indices.tolist()
        top_k_logits, top_k_indices = logits.topk(top_k)
        
        if select_method == "Logit":
            probabilities = torch.softmax(top_k_logits, dim=0).tolist()
            predicted_token_id = random.choices(top_k_indices.tolist(), weights=probabilities, k=1)[0]
            predicted_token = tokenizer.decode([predicted_token_id]).strip()
            
        elif select_method == "Random":
            predicted_token_id = random.choice(top_k_tokens)
            predicted_token = tokenizer.decode([predicted_token_id]).strip()
            
        elif select_method == "Best":
            predicted_token_id = top_k_tokens[0]
            predicted_token = tokenizer.decode([predicted_token_id]).strip()
            
        else:
            raise ValueError("select_method should be one of ['Logit', 'Random', 'Best']")

        text = text.replace("[MASK]", predicted_token, 1)

        print(f"Predicted: {predicted_token} | Current text: {text}")

    return text

text = "30일 전남 무안국제[MASK] 활주로에 전날 발생한 제주항공 [MASK] 당시 기체가 [MASK]착륙하면서 강한 마찰로 생긴 흔적이 남아 있다. 이 참사로 [MASK]과 승무원 181명 중 179명이 숨지고 [MASK]는 형체를 알아볼 수 없이 [MASK]됐다. [MASK] 규모와 [MASK] 원인 등에 대해 다양한 [MASK]이 제기되고 있는 가운데 [MASK]에 설치된 [MASK](착륙 유도 안전시설)가 [MASK]를 키웠다는 [MASK]이 나오고 있다."
result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)

'30일 전남 무안국제터미널 활주로에 전날 발생한 제주항공 사고 당시 기체가 무단착륙하면서 강한 마찰로 생긴 흔적이 남아 있다. 이 참사로 승객과 승무원 181명 중 179명이 숨지고 일부는 형체를 알아볼 수 없이 실종됐다. 사고 규모와 사고 원인 등에 대해 다양한 의혹이 제기되고 있는 가운데 기내에 설치된 ESC(착륙 유도 안전시설)가 사고를 키웠다는 주장이 나오고 있다.'

text = "중국의 수도는 [MASK]이다"
result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
'중국의 수도는 베이징이다'

text = "일본의 수도는 [MASK]이다"
result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
'일본의 수도는 도쿄이다'

text = "대한민국의 가장 큰 도시는 [MASK]이다"
result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
'대한민국의 가장 큰 도시는 인천이다'

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 512
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss
14.3633	0.0986	3000	1.7944
14.0205	0.1973	6000	1.7638
14.0391	0.2959	9000	1.7430
13.8014	0.3946	12000	1.7255
13.6803	0.4932	15000	1.7118
13.5763	0.5919	18000	1.6961
13.4827	0.6905	21000	1.6824
13.3855	0.7892	24000	1.6700
13.2238	0.8878	27000	1.6558
13.0954	0.9865	30000	1.6437

Framework versions

Transformers 4.48.0
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0