bert-base-chinese-finetuned-cmrc2018

This model is a fine-tuned version of bert-base-chinese on the CMRC2018 (Chinese Machine Reading Comprehension) dataset.

Model Description

This is a BERT-based extractive question answering model for Chinese text. The model is designed to locate and extract answer spans from given contexts in response to questions.

Key Features:

  • Base Model: bert-base-chinese
  • Task: Extractive Question Answering
  • Language: Chinese
  • Training Dataset: CMRC2018

Performance Metrics

Evaluation results on the test set:

  • Exact Match: 59.708
  • F1 Score: 60.0723
  • Number of evaluation samples: 6,254
  • Evaluation speed: 283.054 samples/second

Intended Uses & Limitations

Intended Uses

  • Chinese reading comprehension tasks
  • Answer extraction from given documents
  • Context-based question answering systems

Limitations

  • Only supports extractive QA (cannot generate new answers)
  • Answers must be present in the context
  • Does not support multi-hop reasoning
  • Cannot handle unanswerable questions

Training Details

Training Hyperparameters

  • Learning rate: 3e-05
  • Train batch size: 12
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • LR scheduler: linear
  • Number of epochs: 5.0

Training Results

  • Training time: 892.86 seconds
  • Training samples: 18,960
  • Training speed: 106.175 samples/second
  • Training loss: 0.5625

Framework Versions

  • Transformers: 4.47.0.dev0
  • Pytorch: 2.5.1+cu124
  • Datasets: 3.1.0
  • Tokenizers: 20.3

Usage

import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")
tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")

# Prepare inputs
question = "长城有多长?"
context = "长城是中国古代的伟大建筑工程,全长超过2万公里,横跨中国北部多个省份。"

# Tokenize inputs
inputs = tokenizer(
    question,
    context,
    return_tensors="pt",
    max_length=384,
    truncation=True
)

# Get answer
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end])
print("Answer:", answer)

Citation

If you use this model, please cite the CMRC2018 dataset:

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}
Downloads last month
47
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for real-jiakai/bert-base-chinese-finetuned-cmrc2018

Finetuned
(155)
this model

Dataset used to train real-jiakai/bert-base-chinese-finetuned-cmrc2018