Software Entity Recognition with Noise-robust Learning
We train a BERT model for the task software entity recognition (SER). The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. The model uses self-regularization during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others.
The model recognizes 12 fine-grained named entities: Algorithm
, Application
, Architecture
, Data_Structure
, Device
, Error_Name
, General_Concept
, Language
,
Library
, License
, Operating_System
, and Protocol
.
Type | Examples |
---|---|
Algorithm | Auction algorithm, Collaborative filtering |
Application | Adobe Acrobat, Microsoft Excel |
Architecture | Graphics processing unit, Wishbone |
Data_Structure | Array, Hash table, mXOR linked list |
Device | Samsung Gear S2, iPad, Intel T5300 |
Error Name | Buffer overflow, Memory leak |
General_Concept | Memory management, Nouvelle AI |
Language | C++, Java, Python, Rust |
Library | Beautiful Soup, FastAPI |
License | Cryptix General License, MIT License |
Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS |
Protocol | TLS, FTPS, HTTP 404 |
Model details
Paper: https://arxiv.org/abs/2308.10564
Code: https://github.com/taidnguyen/software_entity_recognition
Finetuned from model: bert-base-cased
Checkpoint for large version: https://huggingface.co/taidng/wikiser-bert-large
How to use
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Windows XP was originally bundled with Internet Explorer 6."
ner_results = nlp(example)
print(ner_results)
Citation
@inproceedings{nguyen2023software,
title={Software Entity Recognition with Noise-Robust Learning},
author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
year={2023},
organization={IEEE/ACM}
}
- Downloads last month
- 11