Software Entity Recognition with Noise-robust Learning

We train a BERT model for the task software entity recognition (SER). The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. The model uses self-regularization during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others.

The model recognizes 12 fine-grained named entities: Algorithm, Application, Architecture, Data_Structure, Device, Error_Name, General_Concept, Language, Library, License, Operating_System, and Protocol.

Type Examples
Algorithm Auction algorithm, Collaborative filtering
Application Adobe Acrobat, Microsoft Excel
Architecture Graphics processing unit, Wishbone
Data_Structure Array, Hash table, mXOR linked list
Device Samsung Gear S2, iPad, Intel T5300
Error Name Buffer overflow, Memory leak
General_Concept Memory management, Nouvelle AI
Language C++, Java, Python, Rust
Library Beautiful Soup, FastAPI
License Cryptix General License, MIT License
Operating_System Linux, Ubuntu, Red Hat OS, MorphOS
Protocol TLS, FTPS, HTTP 404

Model details

Paper: https://arxiv.org/abs/2308.10564

Code: https://github.com/taidnguyen/software_entity_recognition

Finetuned from model: bert-base-cased

Checkpoint for large version: https://huggingface.co/taidng/wikiser-bert-large

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Windows XP was originally bundled with Internet Explorer 6."

ner_results = nlp(example)
print(ner_results)

Citation

@inproceedings{nguyen2023software,
  title={Software Entity Recognition with Noise-Robust Learning},
  author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
  booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
  year={2023},
  organization={IEEE/ACM}
}
Downloads last month
11
Safetensors
Model size
108M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.