wikiser-bert-base / README.md
taidng's picture
update paper
00a1620
metadata
tags:
  - software engineering
  - ner
  - named-entity recognition
  - token-classification
widget:
  - text: >-
      In the field of computer graphics, a graphics processing unit (GPU)
      utilizes algorithms such as ray tracing, a rendering technique, to create
      realistic lighting effects in applications like Adobe Acrobat and
      Microsoft Excel.
    example_title: example 1
  - text: >-
      By utilizing the TensorFlow and FastAPI libraries with Python, we are
      optimizing neural network training on devices like the Samsung Gear S2 and
      Intel T5300 processor.
    example_title: example 2
language:
  - en
datasets:
  - wikiser
license: apache-2.0

Software Entity Recognition with Noise-robust Learning

We train a BERT model for the task software entity recognition (SER). The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. The model uses self-regularization during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others.

The model recognizes 12 fine-grained named entities: Algorithm, Application, Architecture, Data_Structure, Device, Error_Name, General_Concept, Language, Library, License, Operating_System, and Protocol.

Type Examples
Algorithm Auction algorithm, Collaborative filtering
Application Adobe Acrobat, Microsoft Excel
Architecture Graphics processing unit, Wishbone
Data_Structure Array, Hash table, mXOR linked list
Device Samsung Gear S2, iPad, Intel T5300
Error Name Buffer overflow, Memory leak
General_Concept Memory management, Nouvelle AI
Language C++, Java, Python, Rust
Library Beautiful Soup, FastAPI
License Cryptix General License, MIT License
Operating_System Linux, Ubuntu, Red Hat OS, MorphOS
Protocol TLS, FTPS, HTTP 404

Model details

Paper: https://arxiv.org/abs/2308.10564

Code: https://github.com/taidnguyen/software_entity_recognition

Finetuned from model: bert-base-cased

Checkpoint for large version: https://huggingface.co/taidng/wikiser-bert-large

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Windows XP was originally bundled with Internet Explorer 6."

ner_results = nlp(example)
print(ner_results)

Citation

@inproceedings{nguyen2023software,
  title={Software Entity Recognition with Noise-Robust Learning},
  author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
  booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
  year={2023},
  organization={IEEE/ACM}
}